Literature DB >> 34068823

Hybrid Dilated Convolution with Multi-Scale Residual Fusion Network for Hyperspectral Image Classification.

Chenming Li1, Zelin Qiu1, Xueying Cao1, Zhonghao Chen1, Hongmin Gao1, Zaijun Hua1.   

Abstract

The convolutional neural network (CNN) has been proven to have better performance in hyperspectral image (HSI) classification than traditional methods. Traditional CNN on hyperspectral image classification is used to pay more attention to spectral features and ignore spatial information. In this paper, a new HSI model called local and hybrid dilated convolution fusion network (LDFN) was proposed, which fuses the local information of details and rich spatial features by expanding the perception field. The details of our local and hybrid dilated convolution fusion network methods are as follows. First, many operations are selected, such as standard convolution, average pooling, dropout and batch normalization. Then, fusion operations of local and hybrid dilated convolution are included to extract rich spatial-spectral information. Last, different convolution layers are gathered into residual fusion networks and finally input into the softmax layer to classify. Three widely hyperspectral datasets (i.e., Salinas, Pavia University and Indian Pines) have been used in the experiments, which show that LDFN outperforms state-of-art classifiers.

Entities:  

Keywords:  HSI classification; local and hybrid dilated convolution; residual fusion networks

Year:  2021        PMID: 34068823      PMCID: PMC8151123          DOI: 10.3390/mi12050545

Source DB:  PubMed          Journal:  Micromachines (Basel)        ISSN: 2072-666X            Impact factor:   2.891


1. Introduction

The technology of hyperspectral remote sensing makes full use of high-altitude detection equipment with visible light, infrared light and microwave and other technical methods through information processing and transmission, which can carry out the remote non-contact classification and recognition of ground objects. Hyperspectral image (HSI) has hundreds of adjacent narrow bands [1] that have a large number of channel dimensions, so it plays a significant role in the field of remote sensing. Hyperspectral image has important information on two sides: one is spectral information, which can provide the ability to differentiate land-cover materials, the other is spatial information which can provide rich information about the spatial structure. Therefore, HSI is applied widely in many domains, such as military exploration [2,3], agriculture [4,5], environment monitoring [6,7] and medical treatment [8]. In the early age of hyperspectral image classification, traditional machine learning methods were widely used, for example, support vector machines (SVM) [9,10], k-nearest neighbor (KNN) [11,12], multinomial logistic regression (MLR) [13,14], decision tree [15,16]. However, within the same material exist spectral differences in different spaces and different materials may have similar spectral characteristics, so the obtained maps are still noisy due to the limited ability of spatial structure feature extraction. In order to resolve the problem where it is difficult to effectively classify hyperspectral images only by spectral features, many methods of artificial extraction of spatial and spectral features are proposed, for example, Markov random fields (MRFs) [17], generalized composite kernel machine [18]. In recent years, with the development of technology, deep learning methods can provide more dynamic automation features. The basic idea of deep learning is that the training model resolves which features are more significant than others in the case of fewer human constraints. Therefore, deep learning methods have been widely used in HSI classification, for instance, M. He et al. [19] proposed a neural network that has a multi-scale 3D deep convolution for HSI classification that can learn 2D multi-scale spatial features and 1D spectral features from the HSI data end-to-end. S. Mei et al. [20] proposed an unsupervised 3D convolutional auto-encoder (3D-CAE) that designed a 3D decoder to reconstruct the input patterns and all parameters could be trained without marking training samples. In [21], a spectral-spatial residual network (SSRN) was put forward, which uses the original 3D cube as input data and continuously studies the distinguishing feature information of HSI through the spectral-spatial residual block. In [22], a contextual deep CNN (D-CNN) optimizes local context interaction by exploiting local spatial-spectral relationships of neighboring individual pixel vectors. In [23], a novel Synergistic CNN which fuses 2D and 3D networks was proposed for accurate HSI classification. In [24], a 3D CNN based on residual group channel and attention network was proposed for HSI classification, which strengthens the spatial features by extracting spatial context information and can reduce the loss of meaningful and useful information. These models use different methods on the basis of deep learning. However, with the depth of network layers becoming deeper, they face difficulties with training and accuracy decline. Considering the problems above, the paper proposed a multi-scale feature fusion network based on local and hybrid dilated convolution (LDFN), which uses a fusion strategy that not only picks up the local information of details but also collects the rich spatial features by expanding the perception field. A residual fusion network was designed to integrate the local convolution and hybrid dilated convolution (HDC), which have a deeper structure of networks and fast connection with other layers. Therefore, our methods have great robustness and excellent ability to learn spatial-spectral information classification. In summary, the main contributions of this paper are three-fold. The proposed hybrid dilated convolution stacked with different sizes of dilation rates is used to extract the spatial information. Local and hybrid dilated convolution methods are integrated which can simply replace the traditional standard convolution. Local convolution connects local pixels closely, which can make our convolution layer more flexible and expressive. Then hybrid dilated convolution is able to raise the field of vision without raising the amount of computation, which can fully collect the spatial-spectral features of hyperspectral images. The proposed new model also uses the specific residual network [25] to fuse the previous HDC and standard convolution on main channels, which can extract multi-scale fusion features. The rest of this paper is arranged as follows. Section 2 discusses the related CNN methods and introduce the LDFN framework for HSI. Section 3 shows experiments over four benchmark hyperspectral datasets. Finally, conclusions are presented in Section 4.

2. Materials and Methods

2.1. Proposed Methods

Traditional CNN consists of several normal operations, such as convolution operations, activation operations, batch normalization operations and pooling operations. The details of the convolution operations are as follows. Convolutional Layers The convolution layer is the most significant part of the convolution neural network. The input of each node is only a small part of the upper layer. Convolution layers analyze feature maps of the previous layer through the filters deeply and obtain more abstract features. Therefore, it can deepen the depth of the network. Let be the input of data, and the size of input data is , where and mean the height and width of the spatial feature, represents the numbers of spectral channels. Let and represent the weight parameter and bias parameter. represents the th layer output and k means kernel. The formula of the convolutional layer [26] is: It should be noted that represents the activation function. Dilated Convolution and HDC While considering the classification algorithm, spatial-spectral characteristics should also be considered. In order to pick up hyperspectral features, standard convolution pays more attention to repetitive operations, which tremendously improves the computational complexity, and local convolution ignores the spatial similarity of adjacent regions. The comparison of standard and dilated convolution is revealed in Figure 1. Dilated convolution is able to expand the perception field of the convolution domain and capture multiscale context information, which is able to effectively settle the matter of insufficient spatial information extraction. However, traditional dilated convolution may lead to two problems, one is the gridding effect which means the kernel is not continuous, the other is that long-ranged information might be not relevant, this means that the method may be invalid on small objects. Therefore, a hybrid dilated convolution, which consists of different sizes of dilation rates, is proposed.
Figure 1

Standard and dilated convolution.

Hybrid dilated convolution consists of different dilation rates that can effectively solve the problems above. Figure 2 illustrates the process from dilated convolution to HDC. HDC has three characteristics, first, the shape of the dilation rate is designed as a zigzag structure. Second, the dilation rates of stacked HDC cannot have a common measure of more than 1. The last, HDC satisfies a formula [27]: where represents the dilation rate of the th layer. represents the largest dilation rate of the ith layer. With the integration of local convolution and HDC, full spatial information can be covered.
Figure 2

Hybrid dilated convolution.

2.2. HSI Classification Based on LDFN

HSIs have four characteristics: band correlation, high resolution, mass of data and spectral variability. In order to solve these problems, the proposed deep CNN consists of local convolution and hybrid dilated convolution, which can not only extract rich spectral content but also rich spatial information. The framework of the proposed LDFN model is shown in Figure 3.
Figure 3

The flowchart of the LDFN model.

In Figure 3, the data of HSIs are pretreated. In the method of preprocessing, which is named principal component analysis (PCA) [28], an algorithm removes some useless bands at first, then the HSI is processed by reducing the dimension. It is carried out to pick up the most effective components of hyperspectral information and then patch blocks centered on label pixels are extracted to train LDFN. The overall process of LDFN is as follows: The original input size of the image block is set to , where represent the height and width of the image in the space dimension, and C represents the number of bands in the spectral dimension. First, input the image block into a two-dimensional convolutional layer. After that, the main channel is divided into two parts. On the one hand, the image block passes upward through two local convolutional layers. On the other hand, the image block descends into a hybrid dilated convolution block (HDC), which is composed of a stack of dilated convolution layers with dilation rates of 2, 3 and 5. Then, the features are integrated, which generates a composite layer and then are fed into a residual block. In the residual block, two convolution layers are used to extract input features and generate an output features layer, then the cross-layer connection is proposed to concatenate the HDC layer, the composite layer and the output features layer. Then, the fused feature map passes through a 1 × 1 two-dimensional convolutional layer, a 2 × 2 average pooling layer and a global average pooling layer. Finally, the high-level features are input to the softmax layer to predict the classification label. The number of filters is 48 except first the convolution’s filters are 16, batch normalization and relu activation are required after each convolution operation except the first local convolution. The size of the dilation rate in HDC is 2, 3 and 5. Last but not least, dropout is required in two local convolutions, the first one is 0.2, the other is 0.5.

3. Results

3.1. Datasets and Baseline

In this paper, three benchmark hyperspectral datasets were used to verify the effectiveness of the proposed LDFN model, which includes the three datasets of the Indian Pines, the Salinas and the University of Pavia. Figure 4 shows the image of the band, the ground truth and color code of the Indian Pines dataset, Figure 5 shows the band image, the ground truth and color code of the Salinas dataset and Figure 6 shows band image, the ground truth and color code of the University of Pavia dataset.
Figure 4

Indian Pines image. (a) Sample band of Indian Pines dataset. (b) Ground truth data. (c) Color band.

Figure 5

Salinas image. (a) Sample band of Salinas dataset. (b) Ground truth data. (c) Color band.

Figure 6

University of Pavia image. (a) Sample band of Pavia University dataset. (b) Ground truth data. (c) Color band.

Supervised learning needs a lot of label data, but hyperspectral label data is rare and the labeling process is very complex. Therefore, the experiment uses small samples, which are able to availably resolve the problem of the insufficient labels of hyperspectral data samples. The proportion of training samples in the three datasets is less than 10%. The Indian Pines dataset consists of pixels and 224 spectral reflectors, the wavelength range is 0.4–2.5 with a spatial resolution of 20 m. The segmentation details of samples are listed in Table 1.
Table 1

The Number of Samples for the Indian Pines Dataset.

#ClassSamplesTrainTest
1Alfalfa46541
2Corn-notill14281431285
3Corn-mintill83083747
4Corn23724213
5Grass-pasture48348435
6Grass-trees73073657
7Grass-pasture-mowed28325
8Hay-windrowed47848430
9Oats20218
10Soybean-notill97297875
11Soybean-mintill24552452210
12Soybean-clean59359534
13Wheat20520185
14Woods12651261139
15Buildings-Grass-Trees-Drives38639347
16Stone-Steel-Towers93984
Total10,24910249225
The Salinas dataset was collected over Salinas Valley in California. The area covered by 217 samples and Salinas ground truth also contains 16 classes. The segmentation details of the samples in the Salinas dataset are listed in Table 2.
Table 2

The Number of Samples for the Salinas Dataset.

#ClassSamplesTrainTest
1Brocoli_green_weeds_12009201989
2Brocoli_green_weeds_23726373689
3Fallow1976201956
4Fallow_rough_plow1394141380
5Fallow_smooth2678272651
6Stubble3959393920
7Celery3579363543
8Grapes_untrained11,27111311,158
9Soil_vinyard_develop6203626141
10Corn_senesced_green_weeds3278333245
11Lettuce_romaine_4wk1068111057
12Lettuce_romaine_5wk1927191908
13Lettuce_romaine_6wk9169907
14Lettuce_romaine_7wk1070111059
15Vinyard_untrained7268727196
16Vinyard_vertical_trellis1807181789
Total54,12954153,588
The dataset from Pavia University was collected during a flight over Pavia in northern Italy. The size of image pixels is and the geometric resolution is 1.3 m. The segmentation details of samples in the University of Pavia dataset are listed in Table 3.
Table 3

The Number of Samples for the University of Pavia Dataset.

#ClassSamplesTrainTest
1Asphalt66311326499
2Meadows18,64937318,276
3Gravel2099422057
4Trees3064613003
5Painted metal sheets1345271318
6Bare Soil50291004929
7Bitumen1330271303
8Self-Blocking Bricks3682743608
9Shadows94719928
Total42,77685541,921
The proposed LDFN model is established on tensorflow2.0 and the keras framework, it uses the programming language python. The experiments are trained and tested on a Geforce GTX 1660 GPU, RAM 16.00 GB. The Adam optimizer [29] is adopted and the epochs are 100 with the mini-batch size of 64. The training group and the test group are divided according to the ratio of 1:9. The initial learning rate is 0.001. To unify the input pixels, in the three used datasets, a pair of adjacent pixel units with the same size is fed into the model. Figure 7 shows the accuracy of OA obtained by LDFN for the three used datasets with different hyperparameters.
Figure 7

Overall accuracy (%) with different hyperparameters on three datasets. (a) patch sizes, (b) principal component numbers.

According to the analysis of the curve in Figure 7a, first, the principal component numbers in the three used datasets are set to 20, then the tendency of curves can be seen clearly in the picture that OA increases rapidly first and gradually goes into a steady state, then curves drop which indicates that larger or smaller patch blocks cannot make the model stable and optimal. Therefore, the size of the patch is set to , and for Indian Pines, Salinas and University of Pavia, respectively. Figure 7b reveals the curve changes with principal component numbers. First, the patch size in the three datasets is set to . As can be seen, the curves of overall accuracy increase till a steady state then drop down, which means that a reasonable expansion of the principal component numbers is conducive to the extraction of rich spectral information, but if the principal component numbers are excessive, it can lead to a decline in the performance of the network. Therefore, the principal component numbers are set to 25, 20 and 20 for the Indian Pines, Salinas and the University of Pavia, respectively.

3.2. Quantitative Metrics and Compared Methods

Three evaluation indexes are used in HSI classification to evaluate the model performance of different methods. Three objective metrics are used, that is, overall accuracy (OA), average accuracy (AA) and the Kappa coefficient. The proposed LDFN is in contrast with different other methods. The comparison methods can be generally split into two groups. One is the traditional machine learning method, including SVM[5]. The other consists of deep learning methods, including 3D-CNN[19], 3D-CAE[20], D-CNN[22] and SSRN[21]. The different methods have the same input size of patch blocks as our LDFN model.

3.3. Classification Results

The first one is carried out on the dataset of the Indian Pines. All methods choose 10% samples to train the model and 90% samples to test. Table 4 reveals the quantitative results of the different methods and specific results are under the average of 10 training results. It is obvious that the accuracies of SVM, 3D-CNN and 3D-CAE are less than 95% in the three metrics above. D-CNN with contextual deep CNN framework and SSRN with several residual blocks have more than 95% accuracy in the three metrics. Overall, our proposed LDFN model has a better performance than the other methods in the three metrics. Figure 8 reveals the classification maps of the different methods clearly for the Indian Pines dataset. SVM has serious noise, 3D-CNN and 3D-CAE are smoother than SVM but still have some obvious noise in vision. SSRN and LDFN perform well and have less noise, furthermore, our LDFN model is better than SSRN in terms of detail.
Table 4

Classification Results of Different Methods for the Indian Pines Dataset.

ClassSVM[5]3D-CNN[19]3D-CAE[20]D-CNN[22]SSRN[21]LDFN
167.0598.0090.4895.2497.82100.00
293.7796.1292.4997.6699.1699.50
367.5580.4990.3797.7297.1196.02
461.2092.0086.9097.7097.5199.05
593.1597.0094.2597.6399.2499.54
695.7096.7797.0799.1698.5799.09
784.0098.0291.2697.2098.70100.00
890.5298.3597.7999.0899.70100.00
975.0586.3075.9093.3398.53100.00
1067.7090.6587.3497.1698.2797.27
1187.6190.1790.2495.5397.1896.90
1261.2192.6095.7696.1797.1297.47
1392.0197.0097.4998.5399.00100.00
1488.7797.8596.0398.3799.1799.22
1588.8196.4390.4897.0699.2099.12
1690.7197.0098.8293.2397.8297.62
OA80.0194.1092.0497.9398.0998.54
AA81.5594.0592.3596.9298.3898.80
Kappa78.3393.4892.2195.1797.0198.34
Figure 8

Classification maps for the Indian Pines dataset. (a) SVM:80.01%. (b) 3D-CNN:94.10%. (c) 3D-CAE:92.04%. (d) D-CNN:97.93%. (e) SSRN:98.09%. (f) LDFN:98.54%.

The second one is based on the dataset of Salinas. A 1% sample was chosen to train the model and 99% to test all the methods. Table 5 reveals the quantitative results of different methods and specific results are under the average of 10 training results. It is obvious that the traditional method of SVM only has a classification accuracy of around 80%. However, the methods of deep learning basically have more than 95% accuracy. The deepening of the network layers in 3D-CNN, 3D-CAE and D-CNN can reach up to an accuracy of 95%. Our OA is 99.36%, AA is 99.56% and Kappa coefficient is 98.29%. Figure 9 reveals the classification diagrams of the Salinas dataset, which clearly shows that the LDFN model is smoother than other compared methods. Therefore, the performance of our LDFN model is better.
Table 5

Classification Results of Different Methods for the Salinas Dataset.

ClassSVM[5]3D-CNN[19]3D-CAE[20]D-CNN[22]SSRN[21]LDFN
180.0097.5499.0097.2099.23100.00
287.9498.8998.2996.9299.94100.00
389.7297.4296.1383.6299.95100.00
482.5598.1097.3496.2897.4998.22
577.8797.9897.3594.7696.70100.00
688.6797.9797.9095.0799.1599.90
789.8698.7197.6497.1299.62100.00
881.3389.6791.5890.8498.1698.53
990.0298.9998.9397.0799.9699.55
1086.5796.2795.9896.4399.4399.81
1190.0098.4898.3795.8797.16100.00
1284.0698.7698.8495.6498.5399.95
1358.1995.8898.5696.2495.8199.66
1457.4998.9497.5295.1098.5398.69
1569.8186.1888.8596.0399.0898.69
1689.5698.7097.3495.1199.35100.00
OA85.9795.2496.0595.3598.3899.36
AA81.4896.7896.8594.9698.6399.56
Kappa83.9394.6695.5195.4698.3699.29
Figure 9

Classification maps for the Salinas dataset. (a) SVM: 85.97% (b) 3D-CNN: 95.24% (c) 3D-CAE: 96.05% (d) D-CNN: 95.35%. (e) SSRN: 98.38%. (f) LDFN: 99.36%.

The third experiment is carried out on the dataset of Pavia University for which 2% of the samples are selected to train all methods’ models and 98% to test all the methods. Table 6 reveals the quantitative results of different methods and the specific results are under the average of 10 training results. As it can be seen, the OA of our LDFN model is 99.19%, AA is 98.89% and Kappa is 98.92%, which shows better performance than the 98.57% OA, 97.16% AA and 98.27% Kappa in SSRN as well as some other deep learning methods that have accuracies around 95%. Due to the loss of use in spatial features, the traditional method of SVM can only reach up to an accuracy of 80% on average. Figure 10 shows the classification maps of different methods for the University of Pavia dataset, which intuitively presents that our LDFN model has better performance in terms of vision, especially for the details of the edge and local parts.
Table 6

Classification Results of Different Methods for the University of Pavia Dataset.

ClassSVM[5]3D-CNN[19]3D-CAE[20]D-CNN[22]SSRN[21]LDFN
190.3693.2795.2196.1198.8099.17
297.2597.6196.0698.9199.6999.95
370.9390.0191.3290.8295.1594.64
490.9394.1798.2892.6395.0299.53
596.4698.0295.5597.6399.14100.00
681.7690.0395.3099.1499.6999.92
783.5980.2195.1493.1296.6899.85
888.1495.9791.3897.7798.7497.24
996.9799.6399.9689.4391.5499.78
OA89.1894.3395.3697.1998.5799.19
AA88.4893.2195.3595.0697.1698.89
Kappa88.6393.0795.1296.2998.2798.92
Figure 10

Classification maps for the University of Pavia dataset. (a) SVM: 89.18% (b) 3D-CNN: 94.33% (c) 3D-CAE: 95.36% (d) D-CNN: 97.19% (e) SSRN: 98.57%. (f) LDFN: 99.19%.

3.4. Comparison of Different Local and HDC Fusion Strategies

In this section, different structures of local and HDC fusion models are compared to prove the effectiveness of the LDFN model. Since the local spatial-spectral contents are extracted by local convolution, the size of the HDC starts at 2 instead of 1. Table 7 reveals the values of OA obtained from the local and HDC fusion models. The LDFN24 represents the dilation rates stacked by size 2 and 4, LDFN25 is stacked by size 2 and 5, LDFN34 consists of size 3 and 4. Particularly, LDFN234 consists of three different dilation rates, the fusion dilation size is 2, 3 and 4. It is obvious in Table 7 that the model with the local and HDC structure achieves better HSI classification results than the compared fusion model D-CNN. Meanwhile, according to accurate experiment results, the proposed LDFN indeed performs better than other methods with different sizes of dilation rates.
Table 7

OA Values Obtained by Local and HDC Fusion Model on Three Datasets.

DatasetMetricD-CNNLDFN24LDFN25LDFN34LDFN234LDFN
Indian PinesOA97.9398.0998.0197.9298.2598.54
SalinasOA95.3599.0198.4798.1299.1199.36
University of PaviaOA97.1998.5998.3097.7999.0799.19
In general, the extensive experiments with three HSI datasets prove that our LDFN model is not only steady and convenient for training, but also effective and advanced in technology.

4. Conclusions

In this paper, a novel deep learning method called the local and hybrid dilated convolution fusion network was proposed for HSI classification. The proposed local and hybrid dilation fusion network fuses local convolution and hybrid dilation convolution, local convolution connects local pixels closely, which can make our convolution layer more flexible and expressive. Hybrid dilation convolution stacked with different dilation rates of 2, 3 and 5 can raise the field of vision and consider the spatial correlation of hyperspectral images in adjacent areas without increasing the amount of computation. It also uses the specific residual fusion network to integrate the previous HDC and standard convolution into the main channels, which can not only solve the problem of the insufficient receptive field but also can extract multi-scale feature information. Experimental results demonstrate that the LDFN model can achieve a satisfactory classification accuracy for hyperspectral images under the lightweight standard. The proposed LDFN model still has great room for improvement. At present, the LDFN model still has redundant parameters and needs to spend some time training to extract spectral-spatial features. In future research, more attention will be paid to multi-scale information fusion and reducing model parameters, which may help optimize the LDFN model and can better integrate spectral features and spatial features for HSI classification.
  2 in total

1.  Evaluation of convolutional neural networks for visual recognition.

Authors:  C Nebauer
Journal:  IEEE Trans Neural Netw       Date:  1998

2.  Going Deeper With Contextual CNN for Hyperspectral Image Classification.

Authors:  Hyungtae Lee; Heesung Kwon
Journal:  IEEE Trans Image Process       Date:  2017-07-11       Impact factor: 10.856

  2 in total
  3 in total

1.  Hyperspectral Image Classification with Optimized Compressed Synergic Deep Convolution Neural Network with Aquila Optimization.

Authors:  Tatireddy Subba Reddy; Jonnadula Harikiran; Murali Krishna Enduri; Koduru Hajarathaiah; Sultan Almakdi; Mohammed Alshehri; Quadri Noorulhasan Naveed; Md Habibur Rahman
Journal:  Comput Intell Neurosci       Date:  2022-07-07

2.  Editorial for the Special Issue on Advanced Machine Learning Techniques for Sensing and Imaging Applications.

Authors:  Bihan Wen; Zhangyang Wang
Journal:  Micromachines (Basel)       Date:  2022-06-29       Impact factor: 3.523

3.  Nonlinear Hyperparameter Optimization of a Neural Network in Image Processing for Micromachines.

Authors:  Mingming Shen; Jing Yang; Shaobo Li; Ansi Zhang; Qiang Bai
Journal:  Micromachines (Basel)       Date:  2021-11-30       Impact factor: 2.891

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.