| Literature DB >> 35161463 |
Xue-Zhi Cui1, Quan Feng1, Shu-Zhi Wang2, Jian-Hua Zhang3.
Abstract
To find an economical solution to infer the depth of the surrounding environment of unmanned agricultural vehicles (UAV), a lightweight depth estimation model called MonoDA based on a convolutional neural network is proposed. A series of sequential frames from monocular videos are used to train the model. The model is composed of two subnetworks-the depth estimation subnetwork and the pose estimation subnetwork. The former is a modified version of U-Net that reduces the number of bridges, while the latter takes EfficientNet-B0 as its backbone network to extract the features of sequential frames and predict the pose transformation relations between the frames. The self-supervised strategy is adopted during the training, which means the depth information labels of frames are not needed. Instead, the adjacent frames in the image sequence and the reprojection relation of the pose are used to train the model. Subnetworks' outputs (depth map and pose relation) are used to reconstruct the input frame, then a self-supervised loss between the reconstructed input and the original input is calculated. Finally, the loss is employed to update the parameters of the two subnetworks through the backward pass. Several experiments are conducted to evaluate the model's performance, and the results show that MonoDA has competitive accuracy over the KITTI raw dataset as well as our vineyard dataset. Besides, our method also possessed the advantage of non-sensitivity to color. On the computing platform of our UAV's environment perceptual system NVIDIA JETSON TX2, the model could run at 18.92 FPS. To sum up, our approach provides an economical solution for depth estimation by using monocular cameras, which achieves a good trade-off between accuracy and speed and can be used as a novel auxiliary depth detection paradigm for UAVs.Entities:
Keywords: edge computing device; monocular depth estimation; self-supervised learning; vineyard scene
Mesh:
Year: 2022 PMID: 35161463 PMCID: PMC8838921 DOI: 10.3390/s22030721
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The structure of MonoDA.
Figure 2The depth estimation subnetwork.
Figure 3Part structures of U-Net and depth estimation subnetwork. (A) U-Net; (B) Depth estimation subnetwork of MonoDA.
Figure 4The pose estimate subnetwork.
Figure 5Partial images of Kitti.
Evaluation criteria of test items over vineyard dataset.
|
| DEE 1 | AFPS 2 | HRO 3 | ||||
|---|---|---|---|---|---|---|---|
| EI 4 | AI 5 | ARE 6 | A1 7 | A2 8 | A3 9 | ||
| 5 | ≤1.5 | 1.50< | 0~5% | 30.0< | ≤30.0% | cores full ≤ 1 | ≤1.5 GB |
| 4 | 1.5~1.625 | 1.35~1.50 | 5~10% | 24.0~30.0 | 30.0~36.0% | 2 cores full | 1.5 GB~1.8 GB |
| 3 | 1.625~1.750 | 1.20~1.35 | 10~15% | 18.0~24.0 | 36.0~42.0% | 3 cores full | 1.8 GB~2.1 GB |
| 2 | 1.750~1.875 | 1.05~1.20 | 15~20% | 12.0~18.0 | 42.0~48.0% | 4 cores full | 2.1 GB~2.4 GB |
| 1 | 1.875~2.00 | 0.90~1.05 | 20~25% | 6.0~12.0 | 48.0~54.0% | 5 cores full | 2.4 GB~2.7 GB |
| 0 | 2.00< | <0.90 | 25%< | <6.0 | 54.0%< | 6 cores full | 2.7 GB< |
1 DEE, Depth Estimation Effect. 2 AFPS, Average of Frames Per Second. 3 HRO, Hardware Resource Occupation. 4 EI, Error Items. The value of EI is the average value of Abs Rel, Sq Rel, RMSE and LG RMSE of the model evaluated over the vineyard dataset. 5 AI, Accuracy Items. The value of AI is the average value of Accuracy items with three thresholds (δ < 1.25, δ < 1.252 and δ < 1.253) of the model evaluated over vineyard dataset. 6 ARE, Average Relative Error. 7 A1 is the usage of GPU, AVE means the average usage of GPU. 8 A2 is the usage of CPU, “Core full” means core occupancy ≥95%. 9 A3 is the usage of RAM.
The evaluation results of several monocular depth estimate models over Kitti.
| Model | Error items | Accuracy items | |||||
|---|---|---|---|---|---|---|---|
| Abs Rel | Sq Rel | RMSE | LG RMSE |
|
|
| |
| GeoNet [ | 0.149 | 1.060 | 5.567 | 0.226 | 0.796 | 0.935 | 0.975 |
| DDVO [ | 0.151 | 1.257 | 5.583 | 0.228 | 0.810 | 0.936 | 0.974 |
| DF-Net [ | 0.150 | 1.124 | 5.507 | 0.223 | 0.806 | 0.933 | 0.973 |
| EPC++ [ | 0.141 | 1.029 | 5.350 | 0.216 | 0.816 | 0.941 | 0.976 |
| Struct2depth [ | 0.141 |
| 5.291 | 0.215 | 0.816 | 0.945 |
|
| MonoDepth2 NP 1 | 0.132 | 1.044 | 5.142 | 0.210 | 0.845 | 0.948 | 0.977 |
| MonoDA (our) |
| 1.035 |
|
|
|
|
|
| MonoDepth2 [ |
|
|
|
|
|
|
|
1 NP, No Pretrained, means without using ImageNet pretrained model.
The evaluation result of MonoDepth2 and MonoDA in vineyard scene.
| Model | Error Items | Accuracy Items | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Abs Rel | Sq Rel | RMSE | LG RMSE | Average 1 |
|
|
| Average | |
| MonoDA |
|
|
|
|
|
|
|
|
|
| MonoDepth2 |
|
|
|
|
|
|
|
|
|
1 Average means the average value of the sum of subitems under item.
Results of estimated distance and comparison with actual distances.
| Model | Distance/m | ARE 1 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1.60 | 2.40 | 3.20 | 4.00 | 4.80 | 6.40 | 7.20 | 8.00 | 8.80 | ||
| MonoDepth2 |
|
|
|
|
|
|
|
|
|
|
| MonoDA |
|
|
|
|
|
|
|
|
|
|
1 ARE, Average Relative Error, is the average of relative errors of 9 positions.
Results of real-time depth estimation on TX2.
| Model | Detection of Real-Time Video | Comprehensive Evaluation | ||||
|---|---|---|---|---|---|---|
| Average Frame Rate | A1 1 | A2 2 | A3 3 | |||
| MonoDepth2 | 16.84 | 51.92% | c1:4–20% | c2:24–78% | 1.8 GB | Medium speed |
| MonoDA | 18.92 | 45.42% | c1:6–27% c3:32–82% c5:4–25% | c2:26–78% | 1.8 GB | Relatively smooth |
1 A1 is the average usage of GPU. 2 A2 is the usage of CPU, ci ( means the 6 cores of CPU. 3 A3 is the usage of RAM.
The scores of the two models.
| Model | Items Score | Total Score | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
| ||||||
|
|
|
|
|
|
| |||
| MonoDepth2 | 3 | 3 | 2 | 2 | 1 | 5 | 4 | 2.4 |
| MonoDA | 3 | 3 | 3 | 3 | 2 | 5 | 4 | 3.1 |
1 EI, Error Items. The value of EI is the average value of Abs Rel, Sq Rel, RMSE and LG RMSE of the model evaluated over the vineyard dataset. 2 AI, Accuracy Items. The value of AI is the average value of Accuracy items with 3 thresholds (δ < 1.25, δ < 1.252 and δ < 1.253) of the model evaluated over the vineyard dataset. 3 ARE, Average Relative Error. , is the number of subitems under ith item. A1, A2 and A3 are the usages of GPU, CPU and RAM respectively.
Figure 6The example depth maps inferred by models with Kitti image.
Figure 7The example of a depth map of vineyard scene inferred by the models.