| Literature DB >> 31007321 |
Patrick Brandao1, Evangelos Mazomenos1, Danail Stoyanov1.
Abstract
Computational stereo is one of the classical problems in computer vision. Numerous algorithms and solutions have been reported in recent years focusing on developing methods for computing similarity, aggregating it to obtain spatial support and finally optimizing an energy function to find the final disparity. In this paper, we focus on the feature extraction component of stereo matching architecture and we show standard CNNs operation can be used to improve the quality of the features used to find point correspondences. Furthermore, we use a simple space aggregation that hugely simplifies the correlation learning problem, allowing us to better evaluate the quality of the features extracted. Our results on benchmark data are compelling and show promising potential even without refining the solution.Entities:
Keywords: 62M45; 62P30; Computer vision; Convolutional neural network; Disparity; Stereo matching
Year: 2019 PMID: 31007321 PMCID: PMC6472548 DOI: 10.1016/j.patrec.2018.12.002
Source DB: PubMed Journal: Pattern Recognit Lett ISSN: 0167-8655 Impact factor: 3.756
Fig. 1Representation of our 7 layered stereo matching CNN. Patches extracted from the left and right stereo images are processed in the blue and orange branches, respectively. During training, the width of the right patch depends of the max disparity (D) considered. After feature extraction with the siamese architecture, the features are aggregated according to their relative displacement. The correlation between features for each disparity is computed by a simple two layer correlation architecture. The final disparity volume represents a correlation value of each possible integer disparity between zero and D for every left patch pixel.
Comparison of several error metrics in % of our three different siamese architectures trained with inner product (inner prod) and with our correlation architecture (learned) on the KITTI 2015 validation set.
| >2 pixel | >3 pixel | >5 pixel | Runtime (s) | |||||
|---|---|---|---|---|---|---|---|---|
| Siamese CNN | Correlation | Non-Occ | All | Non-Occ | All | Non-Occ | All | |
| Inner prod | 11.19 | 12.68 | 10.01 | 11.50 | 8.57 | 10.05 | 1.15 | |
| Learned | 8.26 | 10.72 | 7.10 | 9.71 | 6.82 | 8.40 | 5.25 | |
| Inner prod | 7.80 | 9.36 | 6.81 | 8.37 | 5.75 | 7.30 | 1.15 | |
| Learned | 5.27 | |||||||
| Inner prod | 6.89 | 8.47 | 6.02 | 7.61 | 5.18 | 6.74 | 1.16 | |
| Learned | 7.47 | 8.96 | 6.42 | 7.88 | 5.41 | 6.82 | 5.28 | |
Comparison of several error metrics in % of our three different siamese architectures trained with inner product (inner prod) and with our correlation architecture (learned) on the KITTI 2012 validation set.
| >2 pixel | >3 pixel | >5 pixel | Runtime (s) | |||||
|---|---|---|---|---|---|---|---|---|
| Siamese CNN | Correlation | Non-Occ | All | Non-Occ | All | Non-Occ | All | |
| Inner prod | 12.42 | 14.18 | 11.38 | 13.16 | 9.98 | 11.76 | 1.15 | |
| Learned | 11.27 | 13.05 | 10.39 | 12.13 | 9.08 | 10.82 | 5.25 | |
| Inner prod | 7.57 | 9.45 | 6.72 | 8.61 | 5.64 | 7.53 | 1.15 | |
| Learned | 5.27 | |||||||
| Inner prod | 7.47 | 9.34 | 6.50 | 8.36 | 5.31 | 7.17 | 1.16 | |
| Learned | 7.57 | 10.29 | 6.59 | 9.05 | 5.34 | 7.80 | 5.28 | |
Fig. 3Examples of non-regularized disparities (middle) and errors (right) of KITTI 2012 validation images (left) computed with the S7 architecture and learned correlation.
Fig. 4Examples of non-regularized disparities (middle) and errors (right) of KITTI 2015 validation images (left) computed with the S7 architecture and learned correlation.
Comparison of the 2 pixel % error of different matching siamese architectures without post-processing on the 2012 and 2015 KITTI validation set.
| Method | KITTI 2012 | KITTI 2015 | ||
|---|---|---|---|---|
| Non-Occ | All | Non-Occ | All | |
| MC-CNN-acrt | 15.02 | 16.92 | 15.20 | 16.83 |
| MC-CNN-fast | 17.72 | 19.56 | 18.47 | 20.04 |
| Luo et al. | 10.87 | 12.86 | 9.96 | 11.67 |
| 7.57 | 10.29 | 6.89 | 8.47 | |