Literature DB >> 36236485

Lightweight Depth Completion Network with Local Similarity-Preserving Knowledge Distillation.

Yongseop Jeong¹, Jinsun Park², Donghyeon Cho³, Yoonjin Hwang⁴, Seibum B Choi⁴, In So Kweon⁵.

Abstract

Depth perception capability is one of the essential requirements for various autonomous driving platforms. However, accurate depth estimation in a real-world setting is still a challenging problem due to high computational costs. In this paper, we propose a lightweight depth completion network for depth perception in real-world environments. To effectively transfer a teacher's knowledge, useful for the depth completion, we introduce local similarity-preserving knowledge distillation (LSPKD), which allows similarities between local neighbors to be transferred during the distillation. With our LSPKD, a lightweight student network is precisely guided by a heavy teacher network, regardless of the density of the ground-truth data. Experimental results demonstrate that our method is effective to reduce computational costs during both training and inference stages while achieving superior performance over other lightweight networks.

Entities: Chemical

Keywords: depth completion; knowledge distillation; local similarity; model compression; multimodal learning; sensor fusion

Mesh：
Algorithms
Humans

Year: 2022 PMID： 36236485 PMCID： PMC9573132 DOI： 10.3390/s22197388

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.847

1. Introduction

Recent advances in autonomous driving technologies have realized commercial self-driving platforms operating in dynamic real-world environments [1,2]. These real-world systems often benefit from various sensors, such as color cameras, radars, LiDARs, ultrasonic sensors, and thermal cameras, for robust perception in changing environments [3,4,5]. However, the computational cost typically increases with the increasing number of sensors. This problem is critical to commercial platforms because these systems strictly require real-time performance for reliable and robust operation in real-world environments. To ensure real-time performance, existing systems utilize high-cost custom processing units or lightweight perception agents with reduced computational costs but limited performance [6,7]. Among them, robust depth perception is one of the most important tasks for autonomous platforms. A LiDAR is the most popular sensor for accurate depth perception in both indoor and outdoor environments. It provides highly accurate depth measurements from near to far distances; however, it only collects sparse depth values of a scene due to its mechanical and structural limitations. To overcome this limitation, various depth completion algorithms are proposed to combine RGB and LiDAR data because of their complementary characteristics. Ma and Karaman [8] proposed a simple encoder–decoder network for dense depth estimation. A 4-channel image containing RGB and sparse depth is fed into their network for depth estimation. Moreover, spatial propagation algorithms utilizing local and non-local neighbors are proposed to benefit from relevant local information around sparse depth measurement. Cheng et al. [9] presented a convolutional spatial propagation network (CSPN) for depth completion. The CSPN predicts an initial dense depth and it is iteratively refined by a spatial propagation process with local 8-neighbor pixels. Park et al. [10] proposed a non-local spatial propagation network (NLSPN), which utilizes pixel-wise non-local neighbors during the propagation. Unfortunately, the aforementioned algorithms rely on heavy networks that do not ensure real-time performance. To overcome this limitation, lightweight networks for depth completion tasks were proposed. Tao et al. introduced lightweight depth completion with a Sobel edge prediction network [11] and self-attention-based multi-level feature integration and extraction [12]. Although these approaches contribute to decreasing the computational cost by effectively reducing the parameter size and model complexity, they cannot leverage or surpass the better performance of existing networks. Recently, various knowledge distillation (KD) methods have been proposed to consider the balance between high performance and computational costs. They aim to maintain the robust performance of heavy networks while reducing computational costs and network sizes based on the concept of teacher and student networks. For instance, a heavy teacher network is trained with large-scale datasets, and then a lightweight student network is trained with both large-scale (or small-scale) datasets and precise guidance from the teacher network. With the KD, the lightweight student can achieve better performance compared to the student trained without guidance from the teacher. Therefore, various KD methods have been proposed for numerous low- to high-level perception tasks recently. Xu et al. [13] proposed logit, feature, and structure distillations for human pose estimation. Liu et al. [14] adopted KD for video-based egocentric activity recognition. Yoon et al. [15] proposed spatial- and channel-wise similarity-preserving KD for image matting problems. Yang et al. [16] proposed a cross-image relation KD for semantic segmentation problems. However, typical KD methods require large computational resources during the distillation. Therefore, distillation is often conducted with high-level features requiring small computing resources, although distillation on low-level features is proven to be more effective [15]. In order to benefit from lightweight network architectures with low- to high-level distillation, in this paper, we propose local similarity-preserving knowledge distillation (LSPKD) for depth completion. Previous KD methods [15,17] have demonstrated that the intra-similarity of features can accurately guide student networks during the distillation. However, they utilize global similarity, consuming large computing resources, while local information is more beneficial in various depth completion methods [9,10]. Based on this observation, we propose to focus on local similarity preservation for reduced computational costs during both distillation and inference. With our LSPKD, a lightweight student network achieves superior performance compared to those trained with conventional distillation methods or without distillations.

2. Method

In this section, we first describe the baseline teacher and student architectures for the depth completion. Afterwards, the proposed local similarity-preserving KD is presented.

2.1. Problem Formulation

A dense depth map D can be predicted from network g with a sparse depth map with parameter [18,19]. Due to the sparse nature of typical LiDAR point clouds, it is important to combine local information from the paired color image around these points for accurate dense depth estimation. If a corresponding RGB image I whose pixels are aligned with is utilized as a guide for input sparse depth, (1) can be formulated by The parameter can be optimized to train the network by minimizing loss function with given ground-truth depth . The learning problem is to determine with effectively designed loss function . Predicted depth maps are evaluated based on metrics such as RMSE, MAE, iRMSE, and iMAE [3] to estimate performance. Moreover, the size of parameter mainly affects the computational cost.

2.2. Network Architecture

Various methods have adopted the convolutional neural network [20] and encoder–decoder network architecture with skip connections [8,9,10,21,22] to solve depth completion problems. In this work, we utilize a ResNet34-based network [23] with skip connections as our teacher network for fair comparison. The teacher network comprises two encoders for RGB and LiDAR and one decoder to fuse multi-modal high-level features. Each encoder has an input convolutional layer, 16 successive basic residual blocks [23], and the last convolutional layer. High-level features extracted from encoders are concatenated to be fed into the decoder that consists of 6 deconvolutional layers. The output feature of each decoder layer is concatenated with corresponding RGB and LiDAR encoder features by skip connections, and then fed into the next decoder layer. Figure 1 shows the overall architecture of our baseline teacher network.

Figure 1

An overall pipeline of the proposed algorithm. The ResNet34-based teacher network consists of two separate encoders for RGB and LiDAR and a decoder for depth prediction. Output feature dimensionalities of each layer are shown together. Encoder features from RGB and LiDAR are concatenated and fed into the decoder. Skip connections deliver encoder features to decoder layers by concatenation. The ResNet18-based student network is distilled with the knowledge from the teacher network.

For the student network, we halve the number of basic blocks of the encoders (i.e., ResNet18 [23]) and reduce the number of channels in all the layers of the encoders and decoder. Exact parameter comparisons will be provided for each experimental result separately.

2.3. Local Similarity-Preserving Knowledge Distillation

Hinton et al. have shown that it it possible to transfer knowledge from a large model into a smaller, distilled model and demonstrated that the knowledge distillation (KD) method is applicable for not only image classification but also commercial acoustic model systems [24]. Similarity-preserving KD algorithms [15,17] have demonstrated their effectiveness in various applications, such as classification and image matting. These tasks are suited to exploiting inter-image similarity [17] or global intra-image similarity [15]. However, many depth completion works [9,10] make use of local and non-local information around depth measurements rather than the global information across the entire image due to the geometric nature of natural scenes. In other words, a local area in a scene typically has continuous depth values, except for object boundaries. Moreover, measuring global similarity across the entire image consumes a huge amount of GPU memory during the distillation process [15]. Therefore, conventional methods usually search for a subset of layers of the network to distill due to the limited computational resources. With this observation, we propose a local similarity-preserving KD to effectively utilize the similarity information of low-level features without huge memory requirements during the distillation process. We first calculate the local similarity of a reference feature to its neighbors as follows: where f denotes the normalized feature, x and y are the reference pixel coordinates, j is the index of the neighbors, and and are pixel offsets of the j-th neighbor from the reference, respectively. We adopt the conventional 8-neighbor configuration for the distillation as follows: Note that given a feature map , the local similarity S is calculated for each pixel and then we construct , regardless of the channel dimensionality C, where H, W, and N are the height, width, and the number of local neighbors, respectively. Based on the local similarity S calculated from paired teacher and student layers, the proposed LSPKD loss is defined as follows: where is a weight parameter and t and s indicate that F and S come from the teacher and student networks, respectively. is a dimensionality matching function between teacher and student features in case their channel numbers are different. We adopt a 1 × 1 convolutional layer as for efficiency. The proposed consists of two components. The first term enforces pixel-level feature similarity (with auxiliary dimensionality matching) to directly distill features extracted from the deep network. This direct distillation is simple but effective in transferring valuable knowledge from the teacher to the student [25]. The second term further improves the student by enforcing it to preserve the local similarity of the teacher network. Note that the local similarity is closely related to the affinity, which is proven to be highly effective in densifying predictions for various applications [10,26,27].

2.4. Training Lightweight Depth Completion Network

To train the lightweight student network, we utilize both the dense depth prediction from the teacher and the ground truth (GT). Let , , and be the GT and predictions from the teacher and student networks, respectively. The student prediction can be supervised with and as follows: where loss is adopted for better depth boundary predictions. The final loss function is defined as follows: where and are user parameters.

3. Experiments

In this section, we describe the implementation details of the proposed LSPKD. Then, we present quantitative and qualitative evaluations on two public depth completion benchmark datasets [3,28], as well as in-depth analyses. Moreover, we present the impact of layer selection for knowledge distillation by providing a comparison of performance among the results of various layer combinations. Robustness to the sparsity of the supervision signal is presented to verify the effectiveness of our algorithm.

3.1. Implementation Details

Our algorithm is implemented using the PyTorch framework [29] on a machine equipped with two NVIDIA V100 GPUs. For the training, the ADAM optimizer is used with the initial learning rate 0.001, , and . For all the experiments, we set . We follow conventional depth completion works [8,9,10] and adopt RMSE (mm), MAE (mm), iRMSE (1/km), iMAE (1/km), REL, and for our evaluation metrics. More detailed configurations will be described for each dataset in the following sections. For the distillation, we adopt probabilistic knowledge transfer (PROB) [30] and attention transfer (ATT) [31] for comparisons. These methods are adopted because they introduce small additional computational burdens during the distillation. Implementation details for layer combinations for the distillation will be explained in Section 3.4 in detail.

3.2. KITTI Depth Completion

The KITTI Depth Completion (KITTI DC) dataset [32] provides approximately 86K RGB and LiDAR depth images for the training and 7K images for the validation, respectively. The teacher and student networks are trained for 20 epochs with 8 and 16 batch sizes, respectively. For the student network, we halved the number of channels in all the layers and set . As a result, the student network has approxiately 16.53% parameters compared to those of the teacher network. Table 1 shows quantitative evaluation results on the KITTI DC validation set, as well as the number of parameters and FLOPs. We adopted Self S2D [33] for comparison because it has the same baseline architecture. Note that our teacher network has more parameters because of the individual encoders for the RGB and LiDAR branches. However, due to the progressive downsampling of features, our network requires fewer computational operations. As reported in Table 1, our teacher network shows better performance compared to Self S2D. The small student network trained from scratch shows poor performance, as expected. However, with various distillations, including PROB [30] and ATT [31], the small network achieves a substantial performance improvement. Furthermore, the proposed LSPKD outperforms both PROB and ATT. In addition, LSPKD can be seamlessly combined with PROB and ATT to further improve the performance. We argue that the reason for the superiority of the LSPKD is that the local information is highly important in depth completion tasks. Figure 2 shows qualitative comparisons on the KITTI DC dataset. Compared to the other methods, our method successfully preserves fine depth structures for dense prediction.

Table 1

Quantitative evaluation results on the KITTI DC validation dataset [32] (T: Teacher, S: Student, D: Distilled).

Network	# Params (M)/GFLOPs (912 × 220)	Distillation	Metrics
Network	# Params (M)/GFLOPs (912 × 220)	Distillation	RMSE	MAE	iRMSE	iMAE
Self S2D [33]	26.11/637.89	-	878.6	260.9	3.3	1.3
ResNet34 (T)	51.77/349.36	-	865.2	222.1	2.4	1.0
ResNet18 (S)	8.56/59.22	-	921.5	233.3	2.7	1.0
ResNet18 (D)	8.56/59.22	PROB [30]	902.6	243.3	8.5	1.1
		ATT [31]	907.6	245.0	2.7	1.1
		Ours	893.0	234.9	2.8	1.0
		Ours + PROB	893.7	238.6	2.6	1.0
		Ours + ATT	893.3	243.5	2.6	1.0
		Ours + PROB + ATT	891.8	238.6	2.7	1.0

Figure 2

Depth prediction results on the KITTI DC validation dataset [32]. (a) RGB. (b) Sparse Depth. (c) GT. (d) Teacher. (e) Student. (f) PROB [30]. (g) ATT [31]. (h) Ours.

3.3. NYU Depth V2

The NYU Depth V2 (NYUv2) dataset [28] consists of approximately 50K RGB and depth images for the training and 1.5K images for the evaluation, respectively. The teacher and student networks are trained for 15 epochs with a batch size of 32, similarly to the KITTI DC dataset configuration. For the student network, the number of channels in all the layers is reduced to 1/8 (i.e., 1.30% parameters) and is set to 0.1. Table 2 provides quantitative evaluations on the NYUv2 validation set. Due to the significantly reduced number of parameters, PROB [30] failed to improve the student network (i.e., worse performance than the student trained from scratch). Contrarily, the proposed LSPKD successfully distilled the student network and outperformed the naive student network. Different from the KITTI DC case, combining conventional algorithms does not always lead to improved performance in the NYUv2. Therefore, we conclude that our LSPKD is sufficient for highly lightweight network distillation.

Table 2

Quantitative evaluation results on the NYUv2 validation dataset [28] (T: Teacher, S: Student, D: Distilled).

Network	# Params (M)/GFLOPs (304 × 228)	Distillation	Metrics
Network	# Params (M)/GFLOPs (304 × 228)	Distillation	RMSE	REL	δ1.25	δ1.252	δ1.253
S2D + SPN [8,27]	31.88/24.53	-	172.0	0.0310	0.9710	0.9940	0.9980
DeepLiDAR [34]	143.98/502.12	-	115.0	0.0220	0.9930	0.9990	1.0000
ResNet34 (T)	51.77/112.32	-	114.4	0.0184	0.9932	0.9989	0.9998
ResNet18 (S)	0.66/1.46	-	152.1	0.0282	0.9875	0.9978	0.9995
ResNet18 (D)	0.66/1.46	PROB [30]	154.8	0.0328	0.9891	0.9982	0.9996
		ATT [31]	149.9	0.0302	0.9891	0.9982	0.9996
		Ours	138.8	0.0248	0.9899	0.9984	0.9997
		Ours + PROB	138.9	0.0249	0.9900	0.9984	0.9997
		Ours + ATT	138.7	0.0248	0.9899	0.9984	0.9997
		Ours + PROB + ATT	143.6	0.0268	0.9899	0.9984	0.9997

3.4. Ablation Studies

In this subsection, we provide analyses of the impact of layer selection for distillation and robustness to the sparsity of the supervision signal to verify the effectiveness of our algorithm.

3.4.1. Layer Selection for Distillation

The effectiveness of the distillation on each layer of a deep network can vary drastically depending on the network architecture or target tasks. Table 3 shows performance comparison results with various combinations of layers for the distillation. Overall, the distillation performance is poor when using only the layers in the encoder. Moreover, the performance is degraded when using only the high-level feature layers of the encoder and decoder (i.e., and in Figure 1). In contrast, mid-level layers (i.e., and in Figure 1) have shown a substantial performance improvement when used for the distillation. We presume that the similarities of very low-level or very high-level layers provide limited local or overly wide-range information that is not suitable for depth completion. Thus, we have adopted and for the distillation for all experiments.

Table 3

Performance comparison with various combinations of layers for the distillation on the KITTI DC validation dataset [32].

Encoder					Decoder					Metrics
E0	E1	E2	E3	E4	D0	D1	D2	D3	D4	RMSE	MAE	iRMSE	iMAE
-	-	✓	✓	✓	-	-	-	-	-	899.3	241.8	2.7	1.1
-	-	-	-	-	✓	✓	✓	-	-	897.1	239.9	2.9	1.0
-	✓	✓	✓	-	-	-	-	-	-	894.4	235.5	2.5	1.0
-	-	-	-	-	-	✓	✓	✓	-	896.4	237.9	2.6	1.0
✓	✓	✓	-	-	-	-	-	-	-	901.7	239.4	2.8	1.1
-	-	-	-	-	-	-	✓	✓	✓	899.8	242.0	2.6	1.0
✓	✓	✓	✓	-	-	-	-	-	-	898.1	237.2	2.6	1.0
-	-	-	-	-	-	✓	✓	✓	✓	894.1	238.3	2.6	1.0
-	-	✓	✓	✓	✓	✓	✓	-	-	902.8	236.7	2.7	1.0
-	✓	✓	✓	-	-	✓	✓	✓	-	893.0	234.9	2.8	1.0
✓	✓	✓	-	-	-	-	✓	✓	✓	898.7	236.8	2.6	1.0
✓	✓	✓	✓	-	-	✓	✓	✓	✓	894.9	235.7	2.6	1.0

3.4.2. Sparsity of Supervision

The KITTI DC dataset provides semi-dense ground-truth depth data for the training by accumulating a number of successive frames to the reference frame with outlier filtering. The density (i.e., precision) of the GT can vary depending on how many frames are accumulated. Therefore, this level of GT density is often not available in various real-world scenarios. In the extreme case, there may be only one frame to produce the GT depth data, in which case only very sparse depth data (e.g., exactly the same as the input LiDARs) are available. Therefore, we validate the effectiveness of our method with highly sparse supervision signals (i.e., self-supervision with input LiDARs). We trained the student network with very sparse depth data instead of GT ones. Note that the teacher network is trained by GT and its parameters are fixed during the distillation. Each distillation method achieved the following RMSE: {Naive student: 16140.7, PROB [30]: 1185.4, ATT [31]: 1197.3, Ours: 1179.0}. Note that the density of sparse supervision decreases to 9.1% of the semi-dense GT; therefore, the naive student failed to converge and the overall performance is decreased for all methods. However, our method still achieves the best performance compared to the others. This result empirically demonstrates that our LSPKD is robust to the density of supervision signals thanks to the local similarities.

3.4.3. Comparison to Global Similarity-Preserving KD

We compare the proposed LSPKD with a global similarity-preserving KD method (i.e., SPKD [15]) to validate the efficiency and effectiveness of our method. Because the SPKD requires a huge amount of memory to distill low- and mid-level features, we have distilled and for comparison with the batch size 12, and we obtained the following RMSE and GPU memory consumption for the training per image: {SPKD: 901.6/7.2 GB, LSPKD: 903.6/1.70 GB, LSPKD (Mid-level): 893.0/1.71 GB}. Note that our method shows comparable performance to the SPKD, and outperforms it with the mid-level feature distillation. Low- or mid-level distillations are possible only for our LSPKD because the GPU memory requirement is significantly smaller compared to that of the original SPKD. Therefore, we conclude that our method is suitable for distilling low- or mid-level features without enormous GPU memory requirements for both training and inference for efficiency and performance improvement.

4. Conclusions

In this paper, we have proposed a lightweight depth completion network with local similarity-preserving knowledge distillation. A lightweight depth completion network is effectively trained by the proposed distillation algorithm, with low computational costs for both training and inference stages. The trained network maintains performance comparable to that of previous depth completion networks and superior to the performance of a student network without distillation. Additionally, the experimental result shows that our LSPKD outperforms previous distillation algorithms in both indoor and outdoor datasets. Moreover, the proposed method is verified to be robust to the density level of the supervision signals. For future works, various similarity metrics can be considered for the local similarity estimation.

2 in total

1. A closed-form solution to natural image matting.

Authors: Anat Levin; Dani Lischinski; Yair Weiss
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2008-02 Impact factor: 6.226

2. An Adaptive Fusion Algorithm for Depth Completion.

Authors: Long Chen; Qing Li
Journal: Sensors (Basel) Date: 2022-06-18 Impact factor: 3.847

2 in total