| Literature DB >> 34084931 |
Ghada M Fathy1,2, Hanan A Hassan1, Walaa Sheta1, Fatma A Omara2,3, Emad Nabil2,4.
Abstract
Occlusion awareness is one of the most challenging problems in several fields such as multimedia, remote sensing, computer vision, and computer graphics. Realistic interaction applications are suffering from dealing with occlusion and collision problems in a dynamic environment. Creating dense 3D reconstruction methods is the best solution to solve this issue. However, these methods have poor performance in practical applications due to the absence of accurate depth, camera pose, and object motion.This paper proposes a new framework that builds a full 3D model reconstruction that overcomes the occlusion problem in a complex dynamic scene without using sensors' data. Popular devices such as a monocular camera are used to generate a suitable model for video streaming applications. The main objective is to create a smooth and accurate 3D point-cloud for a dynamic environment using cumulative information of a sequence of RGB video frames. The framework is composed of two main phases. The first uses an unsupervised learning technique to predict scene depth, camera pose, and objects' motion from RGB monocular videos. The second generates a frame-wise point cloud fusion to reconstruct a 3D model based on a video frame sequence. Several evaluation metrics are measured: Localization error, RMSE, and fitness between ground truth (KITTI's sparse LiDAR points) and predicted point-cloud. Moreover, we compared the framework with different widely used state-of-the-art evaluation methods such as MRE and Chamfer Distance. Experimental results showed that the proposed framework surpassed the other methods and proved to be a powerful candidate in 3D model reconstruction. ©2021 Fathy et al.Entities:
Keywords: 3D Model Reconstruction; Dynamic scenes; Occlusion problem; Point cloud; Unsupervised learning
Year: 2021 PMID: 34084931 PMCID: PMC8157153 DOI: 10.7717/peerj-cs.529
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1The proposed framework of 3D model reconstruction from monocular KITTI video images (Geiger, 2013).
KITTI dataset is under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. according to this link: http://www.cvlibs.net/datasets/kitti/.
The main characteristics of the most relevant State-of-the-art.
| Published | Single/Multiple frame | Single/Multiple object | Static/Dynamic object | Input type | Methods |
|---|---|---|---|---|---|
| Single frame | Single object | Static object | RGB-D sensor | Hyper neural network | |
| Single frame | Single object | Static object | 3D Models | GANs neural network | |
| Multiple (2 image sequences) | Single object | Static object | Monocular endoscope | Structure from motion (SfM) | |
| Single frame | Single object | Static object | RGB-D sensor | Monocular SLAM | |
| Single frame | Multiple (full scene) | Static scene (remove dynamic objects) | Monocular RGB | Online incremental mesh generation | |
| Single frame | Single object | Dynamic object | Monocular RGB | Markless 3D human motion capture | |
| Single frame | Single object | Dynamic object | Monocular RGB | GCN network | |
| Single frame | Corp single object | Dynamic object | Monocular RGB | geometric priors, shape reconstruction, and depth prediction | |
| Multiple (two consecutive point-cloud) | Multiple (full scene) | Dynamic objects | Outdoor LiDAR datasets | LSTM and GRU networks | |
| Single frame | Multiple (full scene) | Dynamic objects | Outdoor LiDAR datasets | Predict next scene using LSTM | |
| Single frame | Multiple objects | Dynamic objects | Monocular RGB | Structure from motion | |
| Multiple frames | Single object | Dynamic object | Monocular RGB | Non-rigid structure-from-motion (NRSfM) | |
| Multiple frames (two consecutive) | Multiple (full scene) | Dynamic object | Monocular RGB | Segments the optical flow field into a set of motion models | |
| Multiple (2 frames) | Multiple (full scene) | Dynamic objects | Monocular RGB | Super pixel over segmentation | |
| Proposed framework | Multiple (whole video frames sequence) | Multiple (full scene) | Dynamic objects | Monocular RGB | Unsupervised learning and point cloud fusion |
Figure 2The Proposed framework overview.
Figure 3The pseudo code of 3D model reconstruction process.
Figure 4Localization error, FNE and FPE with different r value.
Figure 5Registration between ground-truth (yellow) and predicted 3D point cloud (blue).
Figure 6Average RMSE for 20 frames.
Figure 7Average fitness for 20 frames.
Figure 8ICP-RMSE for 20 frames.
Figure 93D point cloud mapped to 2D KITTI image (Geiger, 2013).
(A) Selected input frame; (B) ground truth; (C) predicted points. The KITTI dataset is under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. http://www.cvlibs.net/datasets/kitti/.
Figure 10MRE for 3D reconstruction using different techniques on the KITTI dataset.
The improvement percentages in MRE error between proposed framework and the state-of- the-art.
| Approach | Improvement Percentage % |
|---|---|
| BMM | 75.02775 |
| PTA | 83.58463 |
| GBLR | 82.70561 |
| DT | 83.49633 |
| DMDE | 54.39189 |
| DJP | 46.76656 |
Comparison using Chamfer Distance between the proposed framework and state-of-the-art.
| KITTI | |
|---|---|
| Model | Chamfer distance |
| MoNet (LSTM) | 0.573 |
| MoNet (GRU) | 0.554 |