| Literature DB >> 36119611 |
Lina Liu1, Yaqiu Liu1, Yunlei Lv1, Jian Xing1.
Abstract
The 3D reconstruction of forests provides a strong basis for scientific regulation of tree growth and fine survey of forest resources. Depth estimation is the key to the 3D reconstruction of inter-forest scene, which directly determines the effect of digital stereo reproduction. In order to solve the problem that the existing stereo matching methods lack the ability to use environmental information to find the consistency of ill-posed regions, resulting in poor matching effect in regions with weak texture, occlusion and other inconspicuous features, LANet, a stereo matching network based on Linear-Attention mechanism is proposed, which improves the stereo matching accuracy by effectively utilizing the global and local information of the environment, thereby optimizing the depth estimation effect. An AM attention module including a spatial attention module (SAM) and a channel attention module (CAM) is designed to model the semantic relevance of inter-forest scenes from the spatial and channel dimensions. The linear-attention mechanism proposed in SAM reduces the overall complexity of Self-Attention from O(n 2) to O(n), and selectively aggregates the features of each position by weighted summation of all positions, so as to learn rich contextual relations to capture long-range dependencies. The Self-Attention mechanism used in CAM selectively emphasizes interdependent channel maps by learning the associated features between different channels. A 3D CNN module is optimized to adjust the matching cost volume by combining multiple stacked hourglass networks with intermediate supervision, which further improves the speed of the model while reducing the cost of inferential calculation. The proposed LANet is tested on the SceneFlow dataset with EPE of 0.82 and three-pixel-error of 2.31%, and tested on the Forest dataset with EPE of 0.68 and D1-all of 2.15% both of which outperform some state-of-the-art methods, and the comprehensive performance is very competitive. LANet can obtain high-precision disparity values of the inter-forest scene, which can be converted to obtain depth information, thus providing key data for high-quality 3D reconstruction of the forest.Entities:
Keywords: depth estimation; forestry 3D reconstruction; linear-attention; self-attention; stacked hourglasses; stereo match
Year: 2022 PMID: 36119611 PMCID: PMC9478843 DOI: 10.3389/fpls.2022.978564
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
Figure 1Architecture overview of proposed LANet.
Layers and parameter settings of the proposed LANet.
|
|
|
|
|
|---|---|---|---|
| ResNet | conv0_1 | 3 × 3,32,S2 | 1/2H × 1/2W × 32 |
| conv0_2 | 3 × 3,32,S1 | 1/2H × 1/2W × 32 | |
| conv0_3 | 3 × 3,32,S1 | 1/2H × 1/2W × 32 | |
| conv1_x |
| 1/2H × 1/2W × 32 | |
| conv2_x |
| 1/4H × 1/4W × 64 | |
| conv3_x |
| 1/4H × 1/4W × 128 | |
| conv4_x |
| 1/4H × 1/4W × 128 | |
| Attention module | SAM | Linear-attention | 1/4H × 1/4W × 128 |
| Q:1 × 1,16,S1 | |||
| K:1 × 1,16,S1 | |||
| V:1 × 1,128,S1 | |||
| E:Parameter(torch.Tensor(k,1/4H × 1/4W)) | |||
| F:Parameter(torch.Tensor(k,1/4H × 1/4W)) | |||
| 1 × 1,64,S1 | 1/4H × 1/4W × 64 | ||
| CAM | Self-Attention | 1/4H × 1/4W × 128 | |
| 1 × 1,64,S1 | 1/4H × 1/4W × 64 | ||
| Construction of matching cost | Concat | [conv2_16,conv4_3,SAM,CAM] | 1/4H × 1/4W × 320 |
| Fusion | 3 × 3,128,S1 | 1/4H × 1/4W × 32 | |
| 1 × 1,32,S1 | |||
| Concat | Left and shifted right | 1/4D × 1/4H × 1/4W × 64 | |
| 3D CNN aggregation | Preprocess | ||
| conv1 | [3 × 3 × 3,32,S1] × 2 | 1/4D × 1/4H × 1/4W × 32 | |
| conv2 | [3 × 3 × 3,32,S1] × 2 | 1/4D × 1/4H × 1/4W × 32 | |
| output | ADD[conv1,conv2] | 1/4D × 1/4H × 1/4W × 32 | |
| Hourglass Module 1,2,3 | |||
| 3Dstack x_1 | 3Dstack1a:3 × 3 × 3,64,S2 | 1/8D × 1/8H × 1/8W × 64 | |
| 3Dstack1b:3 × 3 × 3,64,S1 | 1/8D × 1/8H × 1/8W × 64 | ||
| 3Dstack x_2 | 3Dstack2a:3 × 3 × 3,128,S2 | 1/16D × 1/16H × 1/16W × 128 | |
| 3Dstack2b:3 × 3 × 3,128,S1 | 1/16D × 1/16H × 1/16W × 128 | ||
| 3Dstack x_3 | deconv1 | 1/8D × 1/8H × 1/8W × 64 | |
| shortcut1 | 1/8D × 1/8H × 1/8W × 64 | ||
| ADD[deconv1 | 1/8D × 1/8H × 1/8W × 64 | ||
| 3Dstack4 x_4 | deconv2 | 1/4D × 1/4H × 1/4W × 32 | |
| shortcut2 | 1/4D × 1/4H × 1/4W × 32 | ||
| ADD[deconv2 | 1/4D × 1/4H × 1/4W × 32 | ||
| Disparity prediction | conv1 | 3 × 3 × 3,32,S1 | 1/4D × 1/4H × 1/4W × 32 |
| conv2 | 3 × 3 × 3,1,S1 | 1/4D × 1/4H × 1/4W × 1 | |
| Upsample | Bilinear interpolation | D × H × W | |
| disparity | Soft Argmin | H × W |
Indicate that ReLU is not included.
Indicate that ReLU and BN are not included, only convolution.
Figure 2Linear-Attention architecture.
Figure 3Linear mapping layers.
Figure 4Visualization of scene flow dataset.
Details of scene flow dataset.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Flying things 3D | 21,818 | 4,248 | 26,066 | 960*540 | Dense | Synthetic |
| Driving | 4,392 | — | 4,392 | 960*540 | Dense | Synthetic |
| Monkaa | 8,591 | — | 8,591 | 960*540 | Dense | Synthetic |
Figure 5Visualization of forest dataset.
Layers and parameter settings of the proposed LANet.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| Larix gmelinii | 72 | 9 | 9 | 90 | 1,240*426 | Dense | Real |
| Pinus sylvestris var. mongolica | 72 | 9 | 9 | 90 | 1,240*426 | Dense | Real |
| Pinus tabulaeformis var. mukdensis | 64 | 8 | 8 | 80 | 1,240*426 | Dense | Real |
| Fraxinus mandschurica Rupr | 56 | 7 | 7 | 70 | 1,240*426 | Dense | Real |
| Betula platyphylla Suk | 56 | 7 | 7 | 70 | 1,240*426 | Dense | Real |
Ablation experiments on scene flow.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Res_Base | 12.78 | 8.11 | 6.41 | 1.65 | 0.12 |
| Res_CAM_Base | 11.12 | 7.02 | 5.36 | 1.21 | 0.14 |
| Res_SA_Base | 10.24 | 6.48 | 4.91 | 1.03 | 0.24 |
| Res_SAM_k128_Base | 10.47 | 6.65 | 5.04 | 1.10 | 0.16 |
| fRes_SAM_k256_Base | 10.38 | 6.58 | 4.98 | 1.07 | 0.17 |
| Res_SAM_k512_Base | 10.29 | 6.52 | 4.93 | 1.05 | 0.18 |
| Res_CAM_SAM_k512_Base | 9.26 | 5.56 | 3.95 | 0.95 | 0.19 |
| Res_CAM_SAM_k512_Hourglass | 7.22 | 3.71 | 2.31 | 0.82 | 0.25 |
Comparison experiments on scene flow.
|
|
|
|
|
|---|---|---|---|
|
|
|
| |
| MC-CNN (Zbontar and LeCun, | 13.70 | 3.79 | — |
| GCNet (Kendall et al., | 9.34 | 2.51 | 3.50 |
| iResNet (Liang et al., | 4.64 | 2.46 | 43.11 |
| DispNet (Mayer et al., | 9.27 | 1.68 | 42.00 |
| CRL (Pang et al., | 6.20 | 1.32 | 78.77 |
| SegStrreo (Yang et al., | 4.74 | 1.77 | — |
| EdgeStereo (Song et al., | 4.35 | 1.45 | — |
| PSMNet (Chang and Chen, | 2.43 | 1.09 | 5.20 |
| LANet(Ours) | 2.31 | 0.82 | 4.50 |
Figure 6Visualization on scene flow. (A) Left images, (B) ground truth, (C) LANet, and (D) PSMNet.
Setting of weighting factors.
|
|
|
| ||
|---|---|---|---|---|
|
|
|
|
|
|
| 0.0 | 0.1 | 1.0 | 1.25 | 1.53 |
| 0.1 | 0.3 | 1.0 | 1.06 | 1.34 |
| 0.3 | 0.5 | 1.0 | 0.91 | 1.01 |
| 0.5 | 0.7 | 1.0 | 0.68 | 0.82 |
| 0.7 | 0.9 | 1.0 | 0.85 | 0.93 |
| 1.0 | 1.0 | 1.0 | 0.94 | 1.04 |
Comparison experiments on Forest.
|
|
|
|
|
|---|---|---|---|
| MC-CNN (Zbontar and LeCun, | 4.08 | 3.96 | 67.09 |
| GCNet (Kendall et al., | 3.65 | 2.79 | 1.01 |
| iResNet (Liang et al., | 3.58 | 2.73 | 0.20 |
| DispNet (Mayer et al., | 3.08 | 1.96 | 0.14 |
| CRL (Pang et al., | 2.75 | 1.54 | 0.55 |
| SegStrreo (Yang et al., | 3.12 | 2.01 | 0.68 |
| EdgeStereo (Song et al., | 2.81 | 1.68 | 0.40 |
| PSMNet (Chang and Chen, | 2.61 | 1.25 | 0.48 |
| LANet (Ours) | 2.15 | 0.68 | 0.35 |
Figure 7Visualization on forest.