Literature DB >> 29695129

Depth Reconstruction from Single Images Using a Convolutional Neural Network and a Condition Random Field Model.

Abstract

This paper presents an effective approach for depth reconstruction from a single image through the incorporation of semantic information and local details from the image. A unified framework for depth acquisition is constructed by joining a deep Convolutional Neural Network (CNN) and a continuous pairwise Conditional Random Field (CRF) model. Semantic information and relative depth trends of local regions inside the image are integrated into the framework. A deep CNN network is firstly used to automatically learn a hierarchical feature representation of the image. To get more local details in the image, the relative depth trends of local regions are incorporated into the network. Combined with semantic information of the image, a continuous pairwise CRF is then established and is used as the loss function of the unified model. Experiments on real scenes demonstrate that the proposed approach is effective and that the approach obtains satisfactory results.

Entities: Chemical Disease Species

Keywords: conditional random field; convolutional neural network; depth reconstruction; single image

Mesh：

Year: 2018 PMID： 29695129 PMCID： PMC5982647 DOI： 10.3390/s18051318

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Measuring the depth of a scene is a significant topic of research in photogrammetry and computer vision, which plays an essential role in various applications in 3D reconstruction, video surveillance, and robotics, etc. Much prior work has performed depth acquisition from multiple images taken in accordance with certain requirements [1,2,3], but in fact, many photos may well be not taken by photogrammetric purposes, but rather taken by the public or amateur photographers. Scene structure cannot be correctly recovered through the traditional photogrammetry, due to lack of corresponding features, or too big or small a baseline between these images. Moreover, there usually exists only a single image of a scene, such as historic photos and images from the Internet. Therefore, depth reconstruction from a single image is a basic task with important research value in photogrammetry and computer vision. The task is an ill-posed and inherently ambiguous problem, as one given image may correspond to an infinite number of possible real world scenes [4]. Therefore, depth acquisition from a single image it is a still challenging issue. Some previous works solve this problem using some depth cues such as geometric characteristics [5,6,7,8], shading [9], texture [10] and contour [11]. However, these works only infere the relative depth of the scene from an image but can’t get the absolute depth. In recent years, many researchers have applied machine learning to the problem and obtained some good results [12,13,14,15,16,17]. A common characteristic of these methods is that they rely on hand-crafted features. Saxena et al. [12] extracted three local features from images: haze, texture variations and texture gradient. Shape- and location-based features were added in [13] for better feature representation. However, these low-level features are still not enough to predict the exact depth values of pixels in an image. Based on Saxena et al. [13], Liu et al. [14] used semantic labels to guide depth reconstruction from a single image. Another challenging problem of these methods is how to utilize extracted image features to measure the depth of each pixel in the image. Many of these methods use a Markov Random Field (MRF) to build the relationships between image features and depth. Unfortunately, it is sensitive to multicolored objects in the image, and involves many assumptions to make the decision. Recently, the Convolutional Neural Network (CNN) method has become a mainstream of image processing research. Compared with those traditional methods applied to depth reconstruction, CNN can learn a high-level of representation automatically without any manual interventions. Eigen et al. [18] used a multi-scale deep network to estimate depth maps from a single image. To perform pixel-level depth inference, Hu et al. [19] trained a CNN with raw RGB image patches cropped by a large window centered at each pixel. Li et al. [20] presented a framework for depth estimation from a single image, which consists of depth regression on superpixels via a deep CNN model and refining from superpixels to pixels via a hierarchical Conditional Random Field (CRF). Similarly, Wang et al. [21] performed depth prediction via regression on CNN model, combined with a post-processing refining step with a hierarchical CRF, but they joined depth and semantic inference, considering that the two problems are mutually beneficial. Unlike the above methods, Liu et al. [22,23], Xu et al. [24] formulated depth prediction as a continuous CRF learning problem, and used a CNN model to learn the feature representation of the image. The approach combined the strength of the CNN and CRF in a unified framework. However, they ignored the importance of semantic information to depth reconstruction and did not resolve depth ambiguities of a scene. In this paper, a unified CNN framework is presented for depth reconstruction from a single image, joining a CNN and a continuous pairwise CRF model. A deep CNN network is firstly designed to automatically learn a hierarchical feature representation of the image. To get local details of the image, relative depth trends of local regions inside the image are integrated into the CNN network. Then, a continuous pairwise CRF is established as the loss function of the unified model through semantic information of a scene and the results of the CNN network in the first step. Depth reconstruction is formulated as a CRF learning problem and can be solved by maximum a posteriori (MAP) inference.

2. Methods

The approach performs pixel-level depth reconstruction from a single image in a unified CNN model framework, shown in Figure 1. The unified model joins a CNN and a continuous pairwise CRF, in which the continuous pairwise CRF is used as the loss function of the CNN. The model architecture consists of three parts: a unitary part, a pairwise part and a CRF loss layer. (1) In the unitary part, a convolutional network is used to obtain convolutional feature maps from the input image. To get feature maps of the superpixels, the convolutional feature maps are fed into a superpixel pooling layer along with the superpixels inside the image. These feature maps are then followed by three fully-connected layers. (2) In the pairwise part, sematic information and similarities of neighboring superpixels are considered and are fed into one fully-connected layer to produce the output. (3) In the loss layer, a continuous pairwise CRF is used as the loss function of the unified CNN framework, which is established via the outputs of the unary and pairwise part.

Figure 1

The overall framework of the unified CNN model.

2.1. Depth Reconstruction Using CRF Model

Given an image with corresponding to depth labels , where n indexes superpixels via over-segmentation, the pairwise CRF are modeled as:where are model parameters and the normalization term. The energy over superpixels and edges takes the following form:where and represent the unary and pairwise potentials respectively. Once the parameters are learned, depth map of an image can be predicted by MAP inference, written as:

2.2. Unitary Part

The unitary part to obtain depth regression of each superpixel in the image uses a deep CNN model for learning feature representation of all the superpixels. The unitary potential of the CNN model is defined as a Euclidean loss associated with the ground-truth depth value and the predication : Usually, the depth of a superpixel is calculated with a single value. However, it is too coarse since depth values of different pixels inside the superpixel may be different. Fortunately, there are many local regions with similar structure from a sematic class, which means that their relative depth trends are nearly same, shown in Figure 2. Therefore, the relative depth trends from the same semantic class can be expressed with a limited normalized depth map called a depth template. A normalized depth map of a superpixel is calculated by the depth value of superpixel centers and scale factors. Given the normalized depth map , the depth value at the superpixel center and the scale factor , the depth map of the superpixel can be defined as: .

Figure 2

Some local regions of similar relative depth trends from a same sematic label. (a,b) Some different local regions (superpixels) from a same sematic label (in the red box); (c) relative depth trends of the local regions in (a,b) are similar.

To obtain depth templates for each semantic label, the normalized depth maps of all the superpixels with the same sematic label are clustered. In this paper, relative depth trends of the superpixels, which are represented by the depth templates, are incorporated into the CNN network. To obtain the absolute depth values of each pixel inside a superpixel, the outputs of the CNN network for the unary part are designed as the depth value at the superpixel center and its normalized scale factor. The structure of the CNN model is similar to that described by Liu et al. [23], but their outputs are different because this paper joins the relative depth trends of the superpixels.

2.3. Pairwise Part

The pairwise part considers the depth relationships between neighboring superpixels, combined with their similarity and semantic information. The pairwise potential of the CRF model is constructed as:where, are parameters. The first term represents the consistency information of the neighboring superpixels with their similarity matrix . is established with color in LUV space, color histogram and texture of Local Binary Pattern. is produced by one fully-connected layer with , defined as: The second term in Equation (6) represents the depth smoothness of the neighboring superpixels with their semantic labels. Here are respectively the sematic labels of and represents the semantic weight between them. The higher the weight value is, the smoother the depth between the neighboring superpixels is. A weight matrix is formed with all the sematic weights. is a matrix, where is the number of the sematic labels in the scene. In the weight matrix, represents the semantic weight of the semantic labels , and .

2.4. CRF Loss Layer

The loss function of the depth reconstruction model uses the negative log-loss of the pairwise CRF, shown in Equation (1). According to Equations (4) and (5), the potential of the CRF can be expressed as: Then Equation (1) can be written as: Here are parameters that can be learned by minimizing Equation (8).

3. Results

The proposed method is evaluated on the Make3D dataset [12]. The Make3D dataset contains 534 images of outdoor scenes composed of eight semantic classes including sky, tree, road, water, grass, building, mountain and foreground objects. The method is quantitatively evaluated by several common measures used in prior work [20,23]: where is predicted depth at pixel , d is the corresponding ground-truth depth, and is the number of pixels in the image. mean relative error (Rel): root mean squared error (Rmse): mean log10 error (Log10): As pointed out in [17], the range of pixels in Make3D is limited to a depth range of 0~81 m, due to the limited range and the resolution of the sensor. As done in [17], two criteria are used to measure the errors: (1) C1 errors are calculated with pixels of the ground-truth depth less than 70 m; (2) C2 errors are computed with all pixels in the image. To evaluate the quantitative results of the proposed method, several state-of-the-art methods are used for comparison. Additionally, considering the influence of the constraint information including sematic information, relative depth trends and CRF on the results, experiments with the dataset are performed, which share the same model with the proposed approach except integrating the constraint information.

3.1. The Experiments with Different Constraint Information

In the experiments, depth maps are predicted via the CNN model with different constraint information. The results are shown in Table 1, where Unconstrained represents the model without integrating the semantic information and relative depth trends of local regions. Sematic_constrained represents the model with integrating only the semantic information. Local_constrained represents the model with integrating only the relative depth trends of local regions. Eucli_loss represents the model in which the loss function of the model replaces the CRF loss with a Euclidean loss and depth reconstruction becomes a regression problem as done in much existing work. A qualitative comparison of depth reconstruction with these methods is presented in Figure 3.

Table 1

Errors of depth reconstruction with different constraints.

Methods	C1 Error			C2 Error
Methods	Rel	Log₁₀ (m)	Rmse (m)	Rel	Log₁₀ (m)	Rmse (m)
Eucli_loss	0.366	0.137	8.63	0.363	0.148	14.41
Unconstrained	0.312	0.113	9.10	0.305	0.120	13.24
Semantic_constrained	0.291	0.109	8.74	0.287	0.114	12.10
Local_constrained	0.295	0.105	8.53	0.291	0.109	11.95
Proposed approach	0.260	0.092	7.16	0.245	0.103	10.07

Figure 3

Qualitative comparison of depth reconstruction via the proposed approach and Unconstrained. Color indicates depth (red is far, blue is close). (a) Test images (b) Unconstrained (c) Proposed approach (d) Ground-truth.

From the results illustrated in Table 1, the following considerations can be outlined. The method through Sematic_constrained can get more satisfactory results compared with Unconstrained, which demonstrates the semantic information is an effective cue for depth reconstruction. Likewise, the relative depth trends of local regions are helpful to depth reconstruction because the results via Local_constrained outperform Unconstrained. The errors of depth reconstruction through Eucli_loss are lower than Unconstrained. This is mainly because their loss functions are different. Eucli_loss uses a Euclidean loss as the loss function of the model. Unlike Eucli_loss, Unconstrained uses a pairwise CRF to establish the loss function, which can consider depth consistency and smoothness between the neighboring superpixels. As result of the semantic information, the relative depth trends and the pairwise CRF incorporated into the model, the proposed approach can get more satisfactory results than other methods.

3.2. The Experiments with Different Methods

To show the effectiveness of the proposed approach, several state-of-the-art methods are tested for comparison: Saxena et al. [13]: The method learns the relation between image features and depth values using MRF. The image features including haze, texture variations and gradient, and shape- and location-based features are manually extracted and represented. Liu et al. [14]: Based on Saxena et al. [13], Liu et al. [14] added semantic labels to guide depth reconstruction from a single image, but the method still depends on hand-crafted features. Depth transfer [25]: The method is a non-parametric learning, which avoids explicitly defining a parametric model and requires fewer assumptions as in other methods [13,14]. Likewise, it still depends on hand-crafted features. DC CRF [17]: In the method, depth prediction is formulated as a discrete-continuous optimization problem, which is solved via particle belief propagation in a graphical model. DCNF [23]: The method performs depth reconstruction by jointing CNN and CRF. Unlike the proposed approach, the method does not consider semantic information and local detail information from images. The results of these methods are shown in Table 2. A qualitative comparison of depth reconstruction is presented in Figure 4.

Table 2

Quantitative comparisons with other methods.

Methods	C1 Error			C2 Error
Methods	Rel	Log₁₀ (m)	Rmse (m)	Rel	Log₁₀ (m)	Rmse (m)
Saxena et al. [13]	-	-	-	0.370	0.187	-
Liu et al. [14]	-	-	-	0.379	0.148	-
Depth transfer [25]	0.355	0.127	9.20	0.361	0.148	15.10
DC CRF [17]	0.335	0.137	9.49	0.338	0.134	12.60
DCNF [23]	0.312	0.113	9.10	0.305	0.120	13.24
Proposed approach	0.260	0.092	7.16	0.245	0.103	10.07

Figure 4

Qualitative comparison of depth reconstruction via the proposed approach and depth transfer [25]. (a) Test images (b) depth transfer [25] (c) Proposed approach (d) Ground-truth.

From the results illustrated in Table 2, the following considerations can be noted: DCNF [23] and the proposed method significantly outperform the other four methods. This is mainly because the other four methods predict depth maps from a single image via hand-crafted features. Instead, DCNF [23] and the proposed method use the CNN model which can automatically learn a high-level of feature representation without any manual intervention. The proposed approach can get more satisfactory results than DCNF [23], because the proposed approach integrated into the semantic information and relative depth trends of local regions. Besides, depth maps are reconstructed for some images not in the Make3D dataset, but from the Internet, which further demonstrate the effectiveness of the proposed approach in Figure 5.

Figure 5

Depth reconstruction for images from the Internet.

4. Discussion

Through the experiments, it is observed that the proposed method is successful at depth reconstruction from a single image with satisfactory accuracy. The proposed approach for depth reconstruction uses a unified CNN framework, joining the advantages of the CNN and the continuous pairwise CRF model. On the one hand, it can the automatically learn hierarchical feature representation of the image via CNN model rather than hand-crafted mode. On the other hand, depth reconstruction is formulated as a CRF learning problem rather than a regression problem due to the loss function that uses a continuous pairwise CRF instead of a Euclidean loss. In the continuous pairwise CRF, the depth consistency and smoothness of neighboring superpixels are considered. Additionally, the unified framework incorporates into the sematic information and relative depth trends of local regions, which can be helpful to resolve depth ambiguities and provide more local details in the image. Therefore, depth reconstruction through the proposed approach is effectiveness and has some improvements.

5. Conclusions

In this paper, the development and implementation of a new approach for depth reconstruction from a single image is presented. A unified framework joining a CNN and pairwise CRF model is used to obtain depth information. A particular feature of the approach is that semantic information and relative depth trends of local regions are integrated into the unified framework. A series of experiments on Make3D dataset are presented in this paper. The experiments with different constraint information demonstrate that the semantic information, the relative depth trends of local regions and CRF model are helpful to depth reconstruction from a single image. The experimental results show that the proposed method is effective and suitable for depth reconstruction.

5 in total

1. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields.

Authors: Fayao Liu; Chunhua Shen; Guosheng Lin; Ian Reid
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2015-12-03 Impact factor: 6.226

2. Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling.

Authors: Kevin Karsch; Ce Liu; Sing Bing Kang
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2014-11 Impact factor: 6.226

3. Clustering by passing messages between data points.

Authors: Brendan J Frey; Delbert Dueck
Journal: Science Date: 2007-01-11 Impact factor: 47.728

4. Consistent depth maps recovery from a video sequence.

Authors: Guofeng Zhang; Jiaya Jia; Tien-Tsin Wong; Hujun Bao
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2009-06 Impact factor: 6.226

5. A Review of Depth and Normal Fusion Algorithms.

Authors: Doris Antensteiner; Svorad Štolc; Thomas Pock
Journal: Sensors (Basel) Date: 2018-02-01 Impact factor: 3.576

5 in total

2 in total

1. High Level 3D Structure Extraction from a Single Image Using a CNN-Based Approach.

Authors: J A de Jesús Osuna-Coutiño; Jose Martinez-Carranza
Journal: Sensors (Basel) Date: 2019-01-29 Impact factor: 3.576

2. Support Vector Machine-Based Transmit Antenna Allocation for Multiuser Communication Systems.

Authors: Huifa Lin; Won-Yong Shin; Jingon Joung
Journal: Entropy (Basel) Date: 2019-05-06 Impact factor: 2.524

2 in total