| Literature DB >> 36236675 |
Yi Liu1,2, Xintao Xu1,3, Bajian Xiang1,2, Gang Chen1,2, Guoliang Gong1,2, Huaxiang Lu1,2,4,5,6.
Abstract
The depth estimation algorithm based on the convolutional neural network has many limitations and defects by constructing matching cost volume to calculate the disparity: using a limited disparity range, the authentic disparity beyond the predetermined range can not be acquired; Besides, the matching process lacks constraints on occlusion and matching uniqueness; Also, as a local feature extractor, a convolutional neural network lacks the ability of global context information perception. Aiming at the problems in the matching method of constructing matching cost volume, we propose a disparity prediction algorithm based on Transformer, which specifically comprises the Swin-SPP module for feature extraction based on Swin Transformer, Transformer disparity matching network based on self-attention and cross-attention mechanism, and occlusion prediction sub-network. In addition, we propose a double skip connection fully connected layer to solve the problems of gradient vanishing and explosion during the training process for the Transformer model, thus further enhancing inference accuracy. The proposed model in this paper achieved an EPE (Absolute error) of 0.57 and 0.61, and a 3PE (Percentage error greater than 3 px) of 1.74% and 1.56% on KITTI 2012 and KITTI 2015 datasets, respectively, with an inference time of 0.46 s and parameters as low as only 2.6 M, showing great advantages compared with other algorithms in various evaluation metrics.Entities:
Keywords: attention; binocular disparity; transformer
Mesh:
Year: 2022 PMID: 36236675 PMCID: PMC9570544 DOI: 10.3390/s22197577
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Disparity algorithm framework based on Transformer.
Figure 2Traditional Transformer structure.
Figure 3A Transformer structure containing double skip connection.
Figure 4Common skip connection structures.
Figure 5Skip connection structures in Transformer.
Figure 6Full connected double skip connection in Transformer.
Figure 7Transformer based on attention.
Figure 8The row and column normalization of attention weights.
Figure 9Occlusion area calculation.
Figure 10Adaptive sub-network.
The software and hardware platform required for Transformer algorithm.
| Platform | Parameters |
|---|---|
| CPU | Intel Core i7-10700 @2.9 GHz |
| GPU | NVIDIA GTX 2080Ti |
| Memory | DDR4 3200 Hz 16 G |
| Operating System | Ubuntu 18.04 LTS |
| Deep Learning Framework | PyTorch 1.7.1 |
Figure 11Intersection and union of predicted bounding box and the real one.
Figure 12Elements in the DispNet/FlowNet2.0 dataset.
Figure 13Splitting the default training and testing sets of DispNet/FlowNet2.0 for pre-training.
Figure 14The comparison between out algorithm and other deep learning-based algorithms.
Quantitative evaluation on the DispNet/FlowNet2.0 dataset.
| EPE | 3PE% | Occulution | Parameters | Runtime | |
|---|---|---|---|---|---|
| PSMNet | 1.25 | 3.31 | — | 5.2 M | 0.59 s |
| AnyNet | 3.19 | — | — | 40,000 | 97.3 ms |
| DeepPruner | 0.86 | 2.13 | — | N/A | 182 ms |
| AANet | 0.87 | — | — | N/A | 62 ms |
| GC-Net | 2.51 | 9.34 | — | 3.5 M | 0.95 s |
| Ours | 0.47 | 1.41 | 98.04 | 2.6 M | 0.46 s |
Figure 15The left and right stereo images of the natural scene in the KITTI dataset.
Figure 16Partial experimental results on KITTI dataset based on our algorithm.
Quantitative evaluation on the KITTI 2012 dataset.
| EPE | 3PE/% | Occlusion | Parameters | Runtime | |
|---|---|---|---|---|---|
| PSMNet | 0.6 | 1.89 | — | 5.2 M | 0.50 s |
| AnyNet | — | 6.10 | — | 40,000 | 97.3 ms |
| DeepPruner | — | 2.03 | — | N/A | 180 ms |
| AcfNet | 0.58 | 1.78 | — | 5.6 M | 0.48 s |
| GC-Net | 0.70 | 2.30 | — | 3.5 M | 0.9 s |
| Ours | 0.57 | 1.74 | 98.80 | 2.6 M | 0.46s |
Quantitative evaluation on the KITTI 2015 dataset.
| EPE | 3PE/% | Occlusion | Parameters | Runtime | |
|---|---|---|---|---|---|
| PSMNet | — | 2.33 | — | 5.2 M | 0.50 s |
| AnyNet | — | 6.20 | — | 40,000 | 97.3 ms |
| DeepPruner | — | 2.15 | — | N/A | 180 ms |
| AcfNet | — | 1.89 | — | 5.6 M | 0.48 s |
| GC-Net | — | 2.87 | — | 3.5 M | 0.9 s |
| Ours | 0.6869 | 2.04 | 99.86 | 2.6 M | 0.46 s |
| Ours* | 0.6098 | 1.56 | 99.87 | 2.6 M | 0.46 s |
Generalization performance test.
| Middlebury | KITTI | |||||
|---|---|---|---|---|---|---|
| EPE | 3pix Error | Occlusion | EPE | 3pix Error | Occlusion | |
| PSMNet | 3.05 | 12.96 | — | 6.56 | 27.79 | — |
| GwcNet | 1.89 | 8.59 | — | 2.21 | 12.60 | — |
| AANet | 2.19 | 12.80 | — | 1.99 | 12.42 | — |
| Ours | 2.23 | 6.09 | 95.5% | 1.40 | 5.74 | 98.7% |