Sangyong Park1, Jaeseon Kim1, Yong Seok Heo1,2. 1. Department of Electrical and Computer Engineering, Ajou University, Suwon 16449, Korea. 2. Department of Artificial Intelligence, Ajou University, Suwon 16449, Korea.
Abstract
To achieve high performance, most deep convolutional neural networks (DCNNs) require a significant amount of training data with ground truth labels. However, creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel in an image is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation. Generally, training DCNNs using a limited amount of data is problematic as it easily results in a decrease in the accuracy of the networks because of overfitting to the training data. Here, we propose a new regularization method called pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data is available. To mitigate the problem caused by limited training data, our method fully utilizes the internal statistics of pixels within an input image. Consequently, the proposed method generates a pixel-wise aggregated probability distribution using a similarity matrix that encodes the affinities between all pairs of pixels. To further increase the accuracy, we add one-hot encoded distributions with ground-truth labels to these aggregated distributions, and obtain our final soft labels. We demonstrate the effectiveness of our method for the Cityscapes dataset and the Pascal VOC2012 dataset using limited amounts of training data, such as 10%, 30%, 50%, and 100%. Based on various quantitative and qualitative comparisons, our method demonstrates more accurate results compared with previous methods. Specifically, for the Cityscapes test set, our method achieved mIoU improvements of 0.076%, 1.848%, 1.137%, and 1.063% for 10%, 30%, 50%, and 100% training data, respectively, compared with the method of the cross-entropy loss using one-hot encoding with ground truth labels.
To achieve high performance, most deep convolutional neural networks (DCNNs) require a significant amount of training data with ground truth labels. However, creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel in an image is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation. Generally, training DCNNs using a limited amount of data is problematic as it easily results in a decrease in the accuracy of the networks because of overfitting to the training data. Here, we propose a new regularization method called pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data is available. To mitigate the problem caused by limited training data, our method fully utilizes the internal statistics of pixels within an input image. Consequently, the proposed method generates a pixel-wise aggregated probability distribution using a similarity matrix that encodes the affinities between all pairs of pixels. To further increase the accuracy, we add one-hot encoded distributions with ground-truth labels to these aggregated distributions, and obtain our final soft labels. We demonstrate the effectiveness of our method for the Cityscapes dataset and the Pascal VOC2012 dataset using limited amounts of training data, such as 10%, 30%, 50%, and 100%. Based on various quantitative and qualitative comparisons, our method demonstrates more accurate results compared with previous methods. Specifically, for the Cityscapes test set, our method achieved mIoU improvements of 0.076%, 1.848%, 1.137%, and 1.063% for 10%, 30%, 50%, and 100% training data, respectively, compared with the method of the cross-entropy loss using one-hot encoding with ground truth labels.
Entities:
Keywords:
limited training data; regularization; self-knowledge distillation; semantic segmentation
The goal of semantic segmentation is to predict the predefined class (or label) of each pixel, which is fundamental yet challenging in computer vision. Owing to its increasing importance, it is widely adopted in various applications using vision sensors, such as autonomous driving [1,2], 3D reconstruction [3], and medical image analysis [4,5]. In recent years, deep convolutional neural networks (DCNNs) have achieved significant performance improvements and have been the dominant solution for semantic segmentation. Since the introduction of FCNs [6], various architectures have been proposed, including U-Net [4], DeepLab [7,8,9,10], and PSPNet [11].To achieve high performance, supervised learning in addition to a significant amount of training data are typically used in DCNN-based methods. Creating ground-truth labels for semantic segmentation requires more time, human effort, and cost compared with other tasks such as classification and object detection, because the ground-truth label of every pixel is required. Hence, it is practically demanding to train DCNNs using a limited amount of training data for semantic segmentation.Generally, training DCNNs using a limited amount of data is problematic because it easily results in a decrease in the accuracy of the networks because of overfitting to the training data [12]. Overfitted models generate good results for the training dataset but subpar results for validation and test datasets, which are not used in training. However, many studies regarding semantic segmentation have focused mainly on improving the accuracy by assuming a significant amount of training data, whereas the problem of insufficient data for training has rarely been prioritized.To mitigate the overfitting problem, the regularization method is widely used. This method includes early stopping [13], -regularization [14], batch normalization [15], dropout [16], data augmentation [17,18,19] and regularizing the predictive distribution of DCNNs [20,21,22,23,24]. Specifically, regularizing the predictive distribution is an approach that regularizes the probabilities of network results. In this regard, various methods exist, such as label smoothing (LS) [20,21], confidence penalty (CP) [22,23], and knowledge distillation (KD) [25,26].LS [20,21] generates a smoothed probability vector by adding a one-hot encoding vector using the ground truth and a uniform vector. It enforces the feature from the penultimate layer to be closest to the template of the correct class, while maintaining the same distance as those of the incorrect classes [20]. Hence, the probability generated using LS does not include the correlation information between classes. CP [22] increases the entropy of the prediction probability distribution by subtracting the entropy of the probability from the loss function. It does not include correlations between classes. In addition, it is problematic to further increase the entropy when the entropy of the probability distribution is already large, because this can render the label decision of the pixel ambiguous. KD improves the performance of the student network using the knowledge of the teacher network. However, a good teacher network is typically required to train the student network. Although the methods described above demonstrate good performances, they do not consider the problem of limited training data, and most of them are designed for classification problems, not semantic segmentation.In this paper, we propose a new regularization method called a pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation to stably train semantic segmentation networks in a practical situation, in which only a limited amount of training data are available. In this regard, we assume that the estimated probability distribution of each pixel exhibits certain relationships and correlations between all pairs of classes [27]. For example, the probabilities of bus and train classes exhibit higher correlations and closer relationships compared with those of bus and sky. Another intuition is that several pixels of the same class exist in an image. Hence, incorrect pixels can benefit from the correct pixels in an image by enforcing consistent distributions between pixels in the same class.Based on these assumptions, the proposed method generates a pixel-wise adaptive soft label to regularize the estimated probability distribution of each pixel by fully utilizing the internal statistics of the pixels within an input image. Figure 1 shows a schematic flowchart of our method. In this regard, we compute a similarity matrix that encodes the affinities between all pairs of pixels. Based on this matrix, an aggregated probability distribution is computed by adaptively combining the probability distributions of correctly estimated pixels at other positions in an image. Our method compensates for insufficient data using soft labels obtained by aggregating the probabilities of other pixels in an image. However, in the early training step, the correctly predicted pixels are insufficient. Hence, we adaptively add a uniform distribution to the aggregated distribution as a function of the number of training iterations. As such, in the early step, a uniform probability has more weight than an aggregated probability. As training progresses, the aggregated distribution yields a larger weight. Although the aggregated distributions facilitate the reduction in the variance error of the estimation, they can result in increase in the bias error [28]. To reduce both bias and variance errors, we added one-hot encoded distributions with ground-truth labels to these aggregated distributions, which yielded our final soft labels.
Figure 1
A schematic flowchart of our method. Our method aggregates distributions based on pair-wise feature similarity and generates a pixel-wise soft label by weighted sum of a one-hot encoding with ground truth label and the aggregated distribution for each pixel according to training iteration.
Figure 2 shows the results of our proposed method and the conventional cross entropy (CE) method [10] for various ratios of limited training data on the Cityscapes dataset [29]. We used the same network as DeepLab-V3+ [10], hyperparameters, and a limited training data to compare those methods. The CE method, which involves less training data, predicts well for load, sidewalk, car, and vegetation classes, but not for bus classes. This is because the pixels for the bus class are fewer in all the training data, and the number of bus class pixels is further reduced in the limited training data. Therefore, overfitting occurs easily in the CE method owing to the limited training data. By contrast, our proposed method yields more accurate results than the CE method.
Figure 2
Comparative results of methods trained using various ratios of limited training data. Results of various ratios of training data including 10%, 30%, 50%, and 100% are shown. Value below each result represents mIoU.
The contributions of our method are summarized as follows:We propose a new probability regularization method for limited training data using a self-knowledge distillation scheme;We propose a pixel-wise adaptive label smoothing (PALS) by fully utilizing the internal statistics of pixels within an input image;We demonstrate the effectiveness of our method by showing improved accuracy compared with previous methods for various ratios of training data, such as 10%, 30%, 50%, and 100% on the Cityscapes dataset [29] and the Pascal VOC2012 dataset [30].
2. Related Work
2.1. Semantic Segmentation
Semantic segmentation is a pixel-wise classification problem that aims to predict the categories of each pixel in a specified image. Various approaches have been proposed to improve the performance of semantic segmentation since the introduction of FCNs [6]. The encoder-decoder architecture [4,31,32] was proposed in early studies to recover spatial losses caused by pooling layers in the networks. Liu et al. [33] and Peng et al. [34] proposed enlarging the receptive field, which is crucial for obtaining context information. In addition to enlarging the receptive field and capturing multiscale context information, refs. [8,9,10,11] proposed pyramid feature pooling methods. To learn semantically richer and spatially more precise feature representations, [35,36,37,38,39,40] combined multiresolution feature maps. Based on the self-attention scheme [41,42], some researchers [43,44,45,46] proposed capturing relational context information by aggregating the relations between pixels.However, because these studies did not consider situations involving limited training data, which are typically encountered in real-world applications, several researchers have proposed weakly/semi-supervised learning-based methods to address this issue. Refs. [47,48,49,50] used image-level labels, refs. [51,52,53] used bounding boxes, and [54,55,56,57,58,59,60] proposed utilizing unlabeled images. Whereas additional data or annotations are required in the above-mentioned methods, Zhao et al. [61] proposed a pretraining to address the problem of limited data. Specifically, they trained a network twice by pretraining a model based on label-based contrastive learning [62] first, and then fine-tuning the model with cross-entropy loss. Unlike the method described in [61], the proposed method does not require any pretraining.
2.2. Regularization
Regularization is a set of techniques that aims to avoid overfitting and improve the generalization of a model. Typical methods to avoid overfitting the training data include /-regularization [14], dropout [16], batch normalization [15] and data augmentation [63]. Additionally, some researchers have proposed regularizing the output of a model using target modification approaches. LS [20,21] uses soft targets, which are the weighted average of one-hot targets and uniform distribution over labels. CP [22,23] regularizes the output of a model by penalizing low-entropy output distributions. These methods prevent the model from becoming overconfident [20,22]. Recently, researchers have extended this idea to other tasks, such as domain adaptation [64,65], incremental learning [66], and self-knowledge distillation (self-KD) [24,67,68]. By contrast, this study focuses on training semantic segmentation models using a limited amount of labeled data. Additionally, the proposed method modifies the target distribution by aggregating the probabilities of pixels based on their similarities to the output of the model.On the other hand, KD [25,69] exploits the predictions of the teacher model, which is relatively large, to transfer knowledge to the student model, which is relatively small. Recently, various approaches have been extended to semantic segmentation [70,71,72]. However, because the training process based on teacher-student knowledge distillation requires additional teacher networks, the computational costs are high. However, it has been demonstrated [24,67,68,73,74] that self-KD, which causes the model to learn knowledge from itself, is effective in exploiting a potential capacity of a single model. Although these works are simple and effective, they do not demonstrate the effectiveness of their works in the limited labeled data setting for semantic segmentation. Table 1 summarizes the strengths and weaknesses of the various regularization methods described above.
Table 1
Strengths and weaknesses of various regularization methods.
Method
Strength
Weakness
LS [20]
It has a positive effect on generalization using the weighted sum of one-hot encoding and the uniform distribution;
It can be applied when a teacher model is not available.
The weighting factor for the uniform distribution is fixed and not learnable;
The uniform distribution is not learnable and is not optimal for each pixel.
CP [22]
It has a positive effect on generalization using the entropy term;
It can be applied when a teacher model is not available.
The weighting factor for the entropy term is fixed and not learnable;
It may increase the ambiguity of the estimated distribution when the entropy of the distribution is already large.
KD [25]
It has a positive effect on generalization by use of the prediction of the teacher network.
It cannot be applied when a teacher model is not available.
Ours
It has a positive effect on generalization using the weighted sum of one-hot encoding and the pixel-wise aggregated distribution;
It can be applied when a teacher model is not available.
The weighting factor for the pixel-wise aggregated distribution is fixed and not learnable.
3. Revisit of CE, LS, CP, KD
In this section, we briefly describe previous distribution regularizing methods, including the CE loss function, LS [20,21], CP [22], and KD [25].
3.1. Cross Entropy
Since the introduction of the FCNs [6], most semantic segmentation networks have been designed using convolutional layers without fully connected layers. The features of the last convolutional layer in a model are known as logits , where C is the number of classes, and and are the height and width of the logits, respectively. The predicted distribution map is then generated from Z, where H is the height, and W is the width of the original input image. It is noteworthy that when Z is different from in terms of the spatial size, Z is typically resized to the same resolution as . In the typical setting, is defined using the softmax operation, as follows:
where denotes the probability of the cth channel of the ith pixel of . Subsequently, the CE loss is defined as
where is a one-hot encoded distribution map using ground-truth labels, and is the value at the cth channel of the ith pixel of Y. is the CE of the ith pixel. Typically, is defined as the average CE value of for all pixels.
3.2. Label Smoothing
LS [20] adds a one-hot encoded ground truth and uniform distribution to generate a smooth probability distribution. Subsequently, the smoothed probability distribution map is defined as follows:
where is the probability distribution vector of the ith pixel of , is a uniform distribution vector, where each element is , and is the weighting factor for a uniform distribution vector. Subsequently, the label smoothing loss is defined as
3.3. Confidence Penalty
CP [22] induces an increase in the entropy of the predicted distribution . The CP [22] loss is defined as follows:
where is a weighting factor, and represents the entropy of .
3.4. Knowledge Distillation
KD transfers the knowledge of a well-trained teacher network to the student network to improve its performance on the student network. Typically, the KD loss function [25] is defined as
where denotes the predicted distribution of the teacher network at the ith pixel. is the Kullback–Leibler (KL) divergence between the two distributions, and is a weighting value of the KL divergence term.
4. Proposed Method
In this section, we introduce our pixel-wise adaptive label smoothing (PALS) via self-knowledge distillation for semantic segmentation. We assume that only a small amount of training data are available to train the network. For each input image, various pixels share the same class, because one object comprises several pixels, and multiple objects may exist in an input image. Hence, our method generates a pixel-wise adaptive soft label for each pixel by aggregating the probability distributions of correctly estimated pixels of the same class. Soft labels function as a teacher in regularizing the distributions of each pixel.Figure 3 shows an overview of our proposed method, which is categorized into training and test paths. Let an input image be , where H is the height, W is the width, and the number of color channels is three. In training path, to improve the network performance, we generate an adaptive soft label map using the proposed PALS module, where C is the number of classes. The structure of the PALS module is explained in detail in the following subsection. By comparing P and the estimated probability distribution , we compute a loss for training the network. In the test path, we predict our result using only the probability distribution .
Figure 3
Overview of the proposed method, which is categorized into training and test paths. Blue and red arrows represent training and test paths, respectively.
4.1. PALS Module
Figure 4 illustrates our PALS module. The input features of the module are logits and penultimate feature map , where K is the number of channels of the penultimate feature map, and and are the spatial sizes. To compute a similarity matrix that contains similarities or correlations between all pairs of features in E, we perform matrix multiplication using reshaped matrices and from E. Therefore, S is defined as
where is a column vector that includes all correlations between a feature of the ith spatial position and all features in E. To perform normalization for each column vector, we performed a softmax operation along each column axis. Subsequently, is defined as
where represents the softmax operation.
Figure 4
Process of our PALS module.
To compensate for insufficient training data, we fully utilize the internal statistics of the pixels in the input image. In this regard, we compute a pixel-wise ensemble of distributions by adaptively aggregating the distributions of other pixels based on the pixel affinity. Thus, we have generated an aggregated distribution map from a proposed probability aggregation (PA) module, which exploits the information of correctly estimated pixels of the same class in an input image. Figure 5 shows the process of the PA module in detail. To compute , we generate a set of class masks and a correct mask B. To generate a class mask that corresponds to the cth class, we create a binary mask for the cth class using a downsampled ground-truth image. An element of in each spatial position has a value of 1 when the ground-truth label corresponds to class c, and 0 otherwise. Furthermore, we reshape to generate a one-dimensional vector . Subsequently, we concatenate the vector C times along the column axis to generate . However, to create a correct mask , we generate a binary map , where each element of V is 1 when the predicted label using Z is correct, and 0 otherwise. We reshape V to generate . Subsequently, the correct mask B is obtained by concatenating the vector C times along column axis. Subsequently, and are defined as
where ⊙ is an element-wise multiplication operation, and ⊗ is a matrix multiplication operation. is an upsampling operation that uses a bilinear interpolation. is a probability distribution map obtained by performing the softmax operation along the channel axis for each pixel from Z and then reshaping it. Q is the upsampled result of .
Figure 5
Process of PA module, where denotes the downsampling operation.
It is noteworthy that the aggregated distribution Q at the early iteration is not sufficiently accurate, because only a few pixels are correct in the early iteration. Hence, we adaptively combined Q and the uniform distribution U as a function of the current iteration number . Subsequently, the fused probability distribution map at iteration is defined as
where is the distribution vector at the ith pixel in . U is a uniform distribution vector where each element is . is the total iteration number, and is the current iteration number. Here, represents the ratio of the current iteration to the total iterations , similar to [75].Generally, aggregated distribution and one-hot encoded distributions exhibit different properties. The former reduces the variance error, whereas the latter reduces the bias error [28]. Therefore, we combined and a one-hot encoded distribution map to reap the advantages of both and then generated the final soft label , called a pixel-wise adaptive label smoothing (PALS). Here, , the probability distribution vector of the ith pixel in at iteration is defined as
where is a one-hot vector at the ith pixel in Y, and is the weighting factor between two vectors and . It is noteworthy that, at the initial iteration, where is 0, is the same as in Equation (3). As iteration progresses, increases up to 1, and the uniform distribution U in Equation (3) is replaced with the pixel-wise aggregated probability Q in Equation (9).
4.2. Loss Function
The loss function for training the network is defined as
where and are the proposed soft target defined in Equation (11) and the predicted distribution of the ith pixel at iteration , respectively. We computed our loss function using the KL divergence between the two distributions.
5. Experiments
In this section, we compare our proposed method with previous methods and analyze the effectiveness of our proposed method based on various experimental settings. Further details are provided in the following subsections.
5.1. Dataset
To perform evaluations, we used the Cityscapes [29] dataset and the Pascal VOC2012 [30] dataset for semantic segmentation. The Cityscapes dataset includes urban scenes for semantic segmentation, and it contains 30 classes; however, we used only 19 classes for training and testing, similar to previous studies [9,10,11]. Each image exhibited a high resolution of . The dataset contains 5000 pixel-level finely annotated images and 20,000 coarsely annotated images. In the finely annotated images, 2975/500/1525 images are allocated for training, validation, and testing, respectively. We used only finely annotated images for training. The Pascal VOC2012 dataset [30] is one of the most competitive semantic segmentation datasets. It contains 21 classes, including 20 foreground classes and 1 background class. This dataset consists of 10,582 training, 1449 validation, and 1456 test images.
5.2. Implementation Details
Our method was applied to the DeepLab-V3+ [10] model, with Xception65 [76] and ResNet18 [77] as backbone networks. The former is a deeper and heavier network than the latter. We initialized the backbone networks using weights pretrained on the ImageNet [78] dataset, whereas the weights of other modules, such as the ASPP module [10], were randomly initialized. To train the networks, we set the initial learning rate to 0.02, and we used the polynomial learning rate scheduler with factor using SGD optimization. For unbiased comparisons, we used the same hyperparameters, including a batch size of 8, and 200 epochs for all the experiments. For the Cityscapes dataset, to evaluate the accuracy of the networks for a limited amount of training data, we randomly selected 10%, 30%, 50%, and 100% of the images from the original training dataset, where each proportion comprises 297, 894, 1487, and 2975 training images, respectively. For data augmentation, we performed random horizontal flipping and random-scale cropping. The random scale range was (0.5, 2.0), and the cropping size was 384 × 384. During training, half-size images were used to reduce memory consumption, and full-size images were used on the validation and test data after the results were upsampled. To identify a suitable weighting factor in Equation (11), we performed several experiments by changing to , and empirically discovered that yielded the best results; hence, we used it for all the experiments.For the Pascal VOC2012 dataset [30], we set mostly the same parameters as those of the Cityscapes dataset except the cropping size and the number of training epochs. Our method was applied to the DeepLab-V3+ [10] with Xception65 [76] as the backbone network for the Pascal VOC2012 dataset. The cropping size was set at 480 × 480, and the number of training epochs was 100. We randomly selected 10%, 30%, 50%, and 100% of the images from the original training set, where each proportion comprises 1059, 3175, 5291, and 10,582 training images, respectively.
5.3. Comparison with Previous Methods
We compared our proposed method with previous methods, including CE [10], LS [20], and CP [22]. For LS [20] and CP [22], we empirically determined the values of in Equation (3) and in Equation (5) that the best performance was achieved using and . For unbiased comparisons, we used the same limited training data for all the comparative methods.
5.3.1. The Cityscapes Dataset
Table 2 lists the mIoU results of the Cityscapes training, validation, and test data for DeepLab-V3+ with the Xception65 network.
Table 2
mIoU values of different methods for DeepLab-V3+ with the Xception65 network. Bold expressions indicate the best accuracy.
Method
Data
10%
30%
50%
100%
CE (baseline) [10]
train
79.054 ± 0.307
81.650 ± 0.511
82.291 ± 0.024
82.586 ± 0.157
val
59.886 ± 0.430
67.756 ± 0.312
69.895 ± 0.212
73.167 ± 0.155
test
59.348 ± 0.046
66.224 ± 1.270
69.522 ± 0.003
72.272 ± 0.237
LS [20]
train
78.117 ± 0.040
82.032 ± 0.001
83.505 ± 0.008
83.219 ± 0.117
val
59.459 ± 0.051
68.822 ± 0.141
70.190 ± 0.499
73.748 ± 0.137
test
59.331 ± 0.015
67.717 ± 0.111
69.606 ± 0.081
72.542 ± 0.082
CP [22]
train
76.269 ± 13.697
79.411 ± 6.892
80.755 ± 4.305
82.303 ± 2.204
val
58.137 ± 3.820
67.715 ± 4.373
70.517 ± 0.393
73.830 ± 0.151
test
57.339 ± 3.997
65.397 ± 1.011
68.650 ± 0.483
72.814 ± 0.643
Ours
train
78.641 ± 0.187
81.784 ± 0.799
82.711 ± 0.002
83.342 ± 0.023
val
59.767 ± 0.209
69.285 ± 0.618
70.974 ± 0.240
73.889 ± 0.288
test
59.424 ± 0.122
68.072 ± 0.077
70.659 ± 0.467
73.335 ± 0.102
Each column represents the data ratio used in the training data, and each row represents a different method. Each method was trained three times, and the average values of the mIoU and corresponding variances are listed in Table 2. It is noteworthy that all methods suffer from overfitting when the amount of training data were sufficiently small. The accuracy for the training data was favorable, whereas that of the validation and test data decreased significantly. The results for the validation data show that our method yielded the best accuracy, except for the results based on only 10% of the data. Meanwhile, based on the results of the test data, our method show the best accuracy for all data ratios as compared with the other methods. LS generates soft labels by adding a uniform distribution to a one-hot vector, which results in a better accuracy than the baseline CE method. However, LS is suboptimal for the regularization function because it does not consider the correlation between classes. CP regularizes the distribution by subtracting its entropy. CP performs worse as the amount of training data decreases because the entropy of the distribution is already large, particularly when the training data are limited.Figure 6 shows the qualitative comparison results of different methods for DeepLab-V3+ [10] with the Xception65 [76] network on the validation data. The ratio numbers in the first column in Figure 6 denote the data ratios used for training from the original training set. It is observed that our method generates results with more accurate boundary regions and less noise for homogeneous regions in the train, truck, and bus objects compared with other methods. For the 10% training data, the results of most methods include severe errors and ambiguous boundaries for the pole and bus classes, which contain fewer labeled pixels than the other classes. By contrast, our method yields more accurate and clearer boundaries for those classes. Similarly, for the 30% training data, our method yields less noise for the object boundaries of cars and buildings as compared with the other methods. For the 50% training data, our method predicts the boundaries of trucks more clearly as compared with other methods. For the 100% training data, our method yields predictions that are better than those of LS [20] and CP [22] for the bus objects, and better than that of CE [10] for the buildings.
Figure 6
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the Xception65 [76] network on the Cityscapes dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.
Table 3 shows the mIoU results of various methods for DeepLab-V3+ [10] with the ResNet18 backbone [77], which is a lighter networks than DeepLab-V3+ [10] with the Xception65 backbone [76]. Table 3 shows that our method achieves the best accuracy, except for the results based on only 100% validation data and 50% test data. As shown in Table 3, LS performs better than CE for most cases, whereas CP [22] is less accurate than CE [10] for the 10% and 50% validation and test data, respectively. This is because a light network typically exhibits lower confidence in term of probability distribution compared with a heavy network [79]. Therefore, CP [22] resulted in reduced accuracy because it enlarged the entropy of the probability distributions.
Table 3
mIoU values of different methods for DeepLab-V3+ with the ResNet18 network. Bold expressions indicate the best accuracy.
Method
Data
10%
30%
50%
100%
CE (baseline) [10]
train
68.878 ± 1.464
74.616 ± 2.380
76.337 ± 5.025
78.619 ± 0.061
val
51.215 ± 2.327
61.104 ± 2.133
63.656 ± 2.274
67.754 ± 3.187
test
51.091 ± 0.873
59.862 ± 1.684
63.506 ± 3.966
68.795 ± 0.213
LS [20]
train
70.774 ± 1.395
78.384 ± 0.195
78.766 ± 0.020
79.837 ± 0.025
val
53.650 ± 0.736
64.088 ± 0.038
65.463 ± 0.050
70.424 ± 0.087
test
54.182 ± 0.508
62.089 ± 0.048
65.507 ± 0.059
69.752 ± 0.139
CP [22]
train
66.839 ± 34.734
74.860 ± 1.199
75.303 ± 1.723
78.585 ± 0.263
val
49.267 ± 12.073
61.485 ± 1.302
63.292 ± 1.325
69.134 ± 0.120
test
49.943 ± 10.104
60.262 ± 1.324
62.889 ± 2.235
69.752 ± 0.139
Ours
train
71.452 ± 1.704
78.227 ± 0.113
78.838 ± 0.131
79.849 ± 0.003
val
54.219 ± 0.065
64.172 ± 0.157
65.672 ± 0.005
70.374 ± 0.008
test
54.683 ± 0.048
62.649 ± 0.022
65.441 ± 0.175
69.837 ± 0.118
Figure 7 shows the qualitative comparison results of different methods using DeepLab-V3+ [10] with the ResNet18 [77] network on the validation data. Because a light network was used, these results indicate less accurate performance than the heavy network. However, our results show clearer boundaries and less noise compared with the other methods. For the 10% training data, our method yielded better predictions than the other methods for rider objects. For the 30% and 50 % training data, our method yielded more accurate results, particularly for truck objects, compared with the other methods. For the 100% training data, our method yielded better predictions for train objects compared with the other methods.
Figure 7
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the ResNet18 [77] network on the Cityscapes dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.
5.3.2. Pascal VOC2012 Dataset
Table 4 shows the mIoU results of various methods for DeepLab-V3+ [10] with the Xception65 network [76] on the Pascal VOC2012 dataset. It is observed that our method achieves the best accuracy for all the cases in validation and test data. Specifically, our method achieved mIoU improvements of 1.447%, 0.713%, 3.185%, and 1.153% for 10%, 30%, 50%, and 100% training data compared with the baseline method, respectively. LS performs better than CE for all cases, whereas CP [22] is less accurate than CE [10], except when using 100% training data.
Table 4
mIoU values of different methods for DeepLab-V3+ with Xception65 network on the Pascal VOC2012 dataset [30]. Bold expressions indicate the best accuracy.
Method
Data
10%
30%
50%
100%
CE (baseline) [10]
val
57.338
67.049
70.079
74.708
test
56.538
68.240
69.745
73.615
LS [20]
val
56.981
69.772
73.733
76.320
test
57.111
68.666
72.535
74.650
CP [22]
val
52.951
68.424
69.603
74.373
test
53.982
67.368
69.452
73.817
Ours
val
58.989
70.939
73.814
76.407
test
57.985
68.953
72.930
74.768
Figure 8 shows the qualitative comparison results of different methods using DeepLab-V3+ [10] with the Xception65 [76] network on the validation data. Since our method aggregates distributions using correctly estimated pixels based on the pair-wise feature similarity, the objects in our results have more accurate boundaries and less noise compared with other methods that independently estimate each pixel. For the 10% training data, for example, the predicted result for the bird class of our method is more accurate than others. For the 30% and 50% training data, most other methods incorrectly predict the dog class and the table class, respectively. In contrast, our method correctly estimates them. For the 100% training data, our method yields better predictions for a complex table class consisting of a multitude of small objects compared with other methods.
Figure 8
Results of the comparison of various methods using limited training data for DeepLab-V3+ [10] with the Xception65 [76] network on the Pascal VOC2012 dataset. (a) Input image. (b) Ground-truth image. (c) CE [10] result. (d) CP [22] result. (e) LS [20] result. (f) Our result.
6. Ablation Study
As introduced in Section 4, our proposed method generates a pixel-wise adaptive soft label P in Equation (11) and uses it to define a loss function. When generating P, we used multiple components, including the same-class masks A, the correct mask B, a uniform distribution U, and an adaptive weight . To investigate the effectiveness of our proposed method, we conducted experiments where each component was removed from our original method. Table 5 shows the mIoU values obtained when each component was removed from our original method. The first row shows the mIoU results of our original method, defined by Equation (11). Here, “without class mask A” represents a method where the mask A is removed in Equation (9) by setting all elements in A to 1. Furthermore, “without correct mask B” represents a method where the mask B is removed in Equation (9) by setting all the elements in B to 1, and “without uniform distribution U” represents a method where the uniform distribution U is removed from Equation (10) by setting the value of to 1 for all the iterations. Lastly, “without adaptive weight ” represents a method that fixes the weight to 0.5 in Equation (10) for all the iterations instead of using it in an adaptive manner.
Table 5
Results of DeepLab-V3+ [10] with the Xception65 [76] on the Cityscapes validation data. “w/o X” represents a method where component of “X” was removed from our original method. Bold expressions indicate the best accuracy.
Method
10%
30%
50%
100%
Our original method
60.279
69.912
71.535
73.849
w/o class mask A
59.181
68.756
71.317
73.490
w/o correct mask B
59.211
69.620
70.706
73.369
w/o uniform distribution U
59.785
69.107
71.279
72.563
w/o adaptive weight ε
59.902
68.681
71.481
72.996
By comparing the results presented in the first and second rows in Table 5, the effectiveness of using A was observed. When pixels of different classes participated in the computation of Q, the probability distributions of the pixels were mixed with those of the other classes, which resulted in less accurate results. By comparing the results of the first and third rows in Table 5, the effectiveness of using B was observed. When the incorrectly estimated pixels participated in the computation of Q, the erroneous probability distributions of the pixels contaminated the final soft targets, which resulted in less accurate results. By comparing the first and fourth row results in Table 5, the effectiveness of using the uniform distribution U was observed. In the early iteration of the training step, the aggregated distribution Q was inaccurate because many pixels were incorrect pixels. Hence, a uniform distribution U is more beneficial than Q in the early iterations. The effectiveness of the adaptive weight was investigated by comparing the first and last rows in Table 5. If we fix the value of to , then the weights of Q and U will be the same for all iterations. Because Q contains reliable information in the later iterations, it cannot fully function adaptively when is fixed.On the other hand, to investigate the effect of varying in Equation (11), we performed several experiments by changing values to {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5}. Table 6 shows the mIoU values of the proposed method on the Cityscapes validation data using DeepLab-V3+ [10] with the Xception65 network [76] as a function of the values. It is observed that our method generates similar performances for various values for most cases. When is 0.2, our method generates the best performance, except for the 50% dataset case. Based on these results, we fixed at 0.2 for all the comparative evaluations.
Table 6
mIoU values of our method by varying values on the Cityscapes validation data. The value of represents the weighting factor in Equation (11). Bold expressions indicate the best accuracy.
α
10%
30%
50%
100%
0.05
59.552
68.080
69.352
73.153
0.10
60.074
68.399
70.269
72.686
0.15
60.201
66.600
68.719
73.093
0.20
60.279
69.912
71.535
73.849
0.25
59.758
69.683
71.830
73.540
0.30
59.807
69.481
69.780
73.450
0.40
58.150
68.095
71.040
72.629
0.50
58.062
69.460
70.531
73.746
7. Conclusions
We have proposed a pixel-wise adaptive label smoothing (PALS) method via self-knowledge distillation to train semantic segmentation networks for limited training data. In this regard, we aggregated the distribution of each pixel to fully utilize redundant information in an image by computing a similarity matrix that encodes the correlations between pairs of pixels. Based on the similarity matrix, we proposed a soft label by progressively adding a one-hot encoded label and the aggregated distribution for each pixel as a function of iteration. Our method yielded the most accurate results for various ratios of limited training data on the Cityscapes dataset and the Pascal VOC2012 dataset compared with previous regularization methods using DeepLab-V3+ with the Xception65 and ResNet18 networks.
Authors: Liang-Chieh Chen; George Papandreou; Iasonas Kokkinos; Kevin Murphy; Alan L Yuille Journal: IEEE Trans Pattern Anal Mach Intell Date: 2017-04-27 Impact factor: 6.226