| Literature DB >> 32252483 |
Donghang Yu1, Qing Xu1, Haitao Guo1, Chuan Zhao1, Yuzhun Lin1, Daoji Li1.
Abstract
Classifying remote sensing images is vital for interpreting image content. Presently, remote sensing image scene classification methods using convolutional neural networks have drawbacks, including excessive parameters and heavy calculation costs. More efficient and lightweight CNNs have fewer parameters and calculations, but their classification performance is generally weaker. We propose a more efficient and lightweight convolutional neural network method to improve classification accuracy with a small training dataset. Inspired by fine-grained visual recognition, this study introduces a bilinear convolutional neural network model for scene classification. First, the lightweight convolutional neural network, MobileNetv2, is used to extract deep and abstract image features. Each feature is then transformed into two features with two different convolutional layers. The transformed features are subjected to Hadamard product operation to obtain an enhanced bilinear feature. Finally, the bilinear feature after pooling and normalization is used for classification. Experiments are performed on three widely used datasets: UC Merced, AID, and NWPU-RESISC45. Compared with other state-of-art methods, the proposed method has fewer parameters and calculations, while achieving higher accuracy. By including feature fusion with bilinear pooling, performance and accuracy for remote scene classification can greatly improve. This could be applied to any remote sensing image classification task.Entities:
Keywords: MobileNet; bilinear model; convolutional neural network; remote sensing image; scene classification
Year: 2020 PMID: 32252483 PMCID: PMC7181261 DOI: 10.3390/s20071999
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Class representatives of the UC Merced dataset: (1) agricultural; (2) airplane; (3) baseball diamond; (4) beach; (5) buildings; (6) chaparral; (7) dense residential; (8) forest; (9) freeway; (10) golf course; (11) harbor; (12) intersection; (13) medium residential; (14) mobile home park; (15) overpass; (16) parking lot; (17) river; (18) runway; (19) sparse residential; (20) storage tanks; and (21) tennis court.
Figure 2Class representatives of the AID dataset: (1) airport; (2) bareland; (3) baseball field; (4) beach; (5) bridge; (6) center; (7) church; (8) commercial; (9) dense residential; (10) desert; (11) farmland; (12) forest; (13) industrial; industrial; (14) meadow; (15) medium residential; (16) mountain; (17) park; (18) parking; (19) playground; (20) pond; (21) port; (22) railway station; (23) resort; (24) river; (25) school; (26) sparse residential; (27) square; (28) stadium; (29) storage tanks; (30) viaduct.
Figure 3Class representatives of NWPU-RESISC45 dataset: (1) airplane; (2) airport; (3) baseball diamond; (4) basketball court; (5) beach; (6) bridge; (7) chaparral; (8) church; (9) circular farmland; (10) cloud; (11) commercial area; (12) dense residential; (13) desert; (14) forest; (15) freeway; (16) golf course; (17) ground track field; (18) harbor; (19) industrial area; (20) intersection; (21) island; (22) lake; (23) meadow; (24) medium residential; (25) mobile home park; (26) mountain; (27) overpass; (28) palace; (29) parking lot; (30) railway; (31) railway station; (32) rectangular farmland; (33) river; (34) roundabout; (35) runway; (36) sea ice; (37) ship; (38) snow berg; (39) sparse residential; (40) stadium; (41) storage tanks; (42) tennis court; (43) terrace; (44) thermal power station; (45) wetland.
Dataset information.
| Dataset Name | Number of Classes | Image Size | Resolution/m | Images per Class | Total Images |
|---|---|---|---|---|---|
| UC Merced | 21 | 256 × 256 | 2 | 100 | 2100 |
| AID | 30 | 600 × 600 | 0.5~8 | 200~400 | 10,000 |
| NWPU-RESISC45 | 45 | 256 × 256 | 0.2~30 | 700 | 31,500 |
Figure 4The difference between (a) standard convolution and (b) depthwise separable convolution. D is the width and height of the convolution kernels, D is the width and height of the output feature maps, M is the channel number of the input feature maps and N is the channel number of the output feature maps.
Figure 5The basic structure of MobileNet: (a) unit in MobileNetv1 [66]; (b) unit in MobileNetv2 [65].
Figure 6Blocks in a convolutional neural network (CNN): (a) standard residual block; (b) inverted residual block in MobileNetv2 when stride is 1; (c) linear block in MobileNetv2 when stride is 2.
Figure 7Pipeline of the bilinear-CNN model proposed by Lin et al. [63].
Figure 8Architecture of BiMobileNet. See text for description.
BiMobileNet details.
| Layer Name | Operation | Input Size | Output Size |
|---|---|---|---|
| Conv2d | Conv2d, kernel size = (3 × 3), stride = 2 | 224 × 224 × 3 | 112 × 112 × 32 |
| Bottleneck-1 | Linear block, | 112 × 112 × 32 | 112 × 112 × 16 |
| Bottleneck-2 | Linear block, | 112 × 112 × 16 | 56 × 56 × 24 |
| Inverted residual block, | 56 × 56 × 24 | 56 × 56 × 24 | |
| Bottleneck-3 | Linear block, | 56 × 56 × 24 | 28 × 28 × 32 |
| Inverted residual block, | 28 × 28 × 32 | 28 × 28 × 32 | |
| Inverted residual block, | 28 × 28 × 32 | 28 × 28 × 32 | |
| Bottleneck-4 | Linear block, | 28 × 28 × 32 | 28 × 28 × 64 |
| Inverted residual block, | 28 × 28 × 64 | 28 × 28 × 64 | |
| Inverted residual block, | 28 × 28 × 64 | 28 × 28 × 64 | |
| Inverted residual block, | 28 × 28 × 64 | 28 × 28 × 64 | |
| Bottleneck-5 | Linear block, | 28 × 28 × 64 | 14 × 14 × 96 |
| Inverted residual block, | 14 × 14 × 96 | 14 × 14 × 96 | |
| Inverted residual block, | 14 × 14 × 96 | 14 × 14 × 96 | |
| Bottleneck-6 | Linear block, | 14 × 14 × 96 | 7 × 7 × 160 |
| Inverted residual block, | 7 × 7 × 160 | 7 × 7 × 160 | |
| Inverted residual block, | 7 × 7 × 160 | 7 × 7 × 160 | |
| Bottleneck-7 | Linear block, | 7 × 7 × 160 | 7 × 7 × 320 |
| Bilinear Pooling | Conv2d-1, kernel size = ( | 7 × 7 × 320 | 7 × 7 × 1024 |
| Conv2d-2, kernel size = ( | 7 × 7 × 320 | 7 × 7 × 1024 | |
| AvgPooling kernel size = (7 × 7) | 7 × 7 × 1024 | 1 × 1 × 1024 | |
| Classification | Fully Connected | 1 × 1 × 1024 | class number |
Overall accuracy of the state-of-the-art method on UC Merced dataset. The highest accuracy for each ratio is bolded.
| Method | Published Year | Training Ratio | ||
|---|---|---|---|---|
| 20% | 50% | 80% | ||
| BOVW [ | 2010 | 76.81 | ||
| VLAT [ | 2014 | 94.30 | ||
| MS-CLBP+FV [ | 2016 | 88.76 ± 0.79 | 93.00 ± 1.20 | |
| TEX-NET-FL (ResNet) [ | 2017 | 96.91 ± 0.36 | 97.72 ± 0.54 | |
| salM3LBP-CLM [ | 2017 | 91.21 ± 0.75 | 95.75 ± 0.80 | |
| VGG-VD-16 [ | 2017 | 94.14 ± 0.69 | 95.21 ± 1.20 | |
| CNN-ELM [ | 2017 | 95.62 ± 0.32 | ||
| Two-Stream Fusion [ | 2018 | 98.02 ± 1.03 | ||
| D-CNN (VGG16) [ | 2018 | 98.93 ± 0.10 | ||
| RTN (VGG16) [ | 2018 | 98.96 | ||
| DCF (VGG-VD16) [ | 2018 | 95.42 ± 0.71 | 97.10 ± 0.85 | |
| GCFs+LOFs (VGG16) [ | 2018 | 97.37 ± 0.44 | 99.00 ± 0.35 | |
| SAL-TS-Net (GoogLeNet) [ | 2018 | 97.97 ± 0.56 | 98.90 ± 0.95 | |
| Siamese ResNet50 [ | 2019 | 76.50 | 90.95 | 94.29 |
| SF-CNN (VGGNet) [ | 2019 | 99.05 ± 0.27 | ||
| VGG16-DF [ | 2019 | 5298.97 | ||
| MRBF [ | 2019 | 94.19 ± 0.15 | ||
| DDRL-AM (ResNet18) [ | 2019 |
| ||
| WSPM-CRC (ResNet152) [ | 2019 | 97.95 | ||
| CTFCNN [ | 2019 | 98.44 ± 0.58 | ||
| CapsNet (Inception-v3) [ | 2019 | 97.59 ± 0.16 | 99.05 ± 0.24 | |
| BiMobileNet (MobileNetv2) | 2020 |
|
| 99.03 ± 0.28 |
Overall accuracy of BiMobileNet under different training ratios. The highest accuracy for each ratio is bolded.
| Method | Training Ratio | ||||
|---|---|---|---|---|---|
| 5% | 10% | 15% | 20% | 25% | |
| Fine-tuning VGG16 | 39.53 ± 2.23 | 53.12 ± 1.15 | 59.83 ± 2.45 | 64.68 ± 2.70 | 69.51 ± 0.65 |
| Fine-tuning ResNet50 | 39.01 ± 1.62 | 51.35 ± 1.25 | 57.40 ± 0.96 | 64.82 ± 0.64 | 71.06 ± 0.72 |
| Fine-tuning MobileNetv2 | 38.64 ± 1.45 | 52.85 ± 0.85 | 60.90 ± 1.26 | 67.86 ± 1.12 | 72.48 ± 0.40 |
| BiMobileNet |
|
|
|
|
|
Figure 9Confusion matrix using a training ratio of 5% on the UC Merced dataset.
Figure 10Confusion matrix using a training ratio of 10% on the UC Merced dataset.
Figure 11Confusion matrix using a training ratio of 20% on the UC Merced dataset.
Figure 12Confusion matrix using a training ratio of 50% on the UC Merced dataset.
Overall accuracy of the state-of-the-art methods on AID dataset. The highest accuracy for each ratio is bolded.
| Method | Published Year | Training Ratio | ||
|---|---|---|---|---|
| 10% | 20% | 50% | ||
| TEX-Net-LF (ResNet) [ | 2017 | 93.81 ± 0.12 | 95.73 ± 0.16 | |
| salM3LBP-CLM [ | 2017 | 86.92 ± 0.35 | 89.76 ± 0.45 | |
| VGG-VD-16 [ | 2017 | 86.59 ± 0.29 | 89.64 ± 0.36 | |
| DCA (VGGNet) [ | 2017 | 91.86 ± 0.28 | ||
| RTN (VGG16) [ | 2018 | 92.44 | ||
| D-CNN (VGG16) [ | 2018 | 90.82 ± 0.16 |
| |
| GCFs+LOFs (VGG16) [ | 2018 | 92.48 ± 0.38 | 96.85 ± 0.23 | |
| SAL-TS-Net (GoogleNet) [ | 2018 | 94.09 ± 0.34 | 95.99 ± 0.35 | |
| MRBF [ | 2019 | 87.26 ± 0.42 | ||
| SF-CNN (VGGNet) [ | 2019 | 93.60 ± 0.12 | 96.66 ± 0.11 | |
| CTFCNN [ | 2019 | 94.91 ± 0.24 | ||
| WSPM-CRC (ResNet152) [ | 2019 | 95.11 | ||
| BiMobileNet (MobileNetv2) | 2020 |
|
| 96.87 ± 0.23 |
Figure 13Confusion matrix using a training ratio of 10% on the AID dataset.
Figure 14Confusion matrix using a training ratio of 20% on the AID dataset.
Figure 15Confusion matrix using a training ratio of 50% on the AID dataset.
Overall accuracy of the state-of-the-art methods on NWPU-RESISC45 dataset. The highest accuracy for each ratio is bolded.
| Method | Published Year | Training Ratio | |
|---|---|---|---|
| 10% | 20% | ||
| Fine-tune VGG16 [ | 2017 | 87.15 ± 0.45 | 90.36 ± 0.18 |
| D-CNN (VGG16) [ | 2018 | 89.22 ± 0.50 | 91.89 ± 0.22 |
| IOR4 (VGG16) [ | 2018 | 87.83 ± 0.16 | 91.30 ± 0.17 |
| RTN (VGG16) [ | 2018 | 89.90 | 92.71 |
| DCF (VGG-VD16) [ | 2018 | 87.14 ± 0.19 | 89.56 ± 0.25 |
| DDRL-AM (ResNet18) [ | 2018 |
| 92.46 ± 0.09 |
| SAL-TS-Net (GoogLeNet) [ | 2018 | 85.02 ± 0.20 | 87.01 ± 0.19 |
| Triplet Network [ | 2018 | 92.33 ± 0.50 | |
| VGG16-DF [ | 2019 | 89.66 | |
| Siamese ResNet50 [ | 2019 | 92.28 | |
| SF-CNN (VGG16) [ | 2019 | 89.89 ± 0.16 | 92.55 ± 0.14 |
| GLANet [ | 2019 | 91.03 ± 0.18 | 93.45 ± 0.17 |
| CapsNet (Inception-v3) [ | 2019 | 89.03 ± 0.21 | 92.60 ± 0.11 |
| DML (VGG16) [ | 2019 | 91.73 ± 0.21 | 93.47 ± 0.30 |
| BiMobileNet (MobileNetv2) | 2020 | 92.06 ± 0.14 |
|
Figure 16Confusion matrix using a training ratio of 10% on the NWPU-RESISC45 dataset.
Figure 17Confusion matrix using a training ratio of 20% on the NWPU-RESISC45 dataset.
Accuracy and complexity of state-of-the-art methods on NWPU-RESISC45 dataset. The highest accuracy and least calculation are bolded.
| Methods |
| Overall Accuracy | Parameters Numbers (Million) | GFLOPs1 | Model Size MB | |
|---|---|---|---|---|---|---|
| 10% | 20% | |||||
| Fine-tuning VGG16 [ | / | 87.15 ± 0.45 | 90.36 ± 0.18 | ~134.44 | ~15.60 | ~512.87 |
| SF-CNN (VGG16) [ | / | 89.89 ± 0.16 | 92.55 ± 0.14 | ~17.28 | ~15.49 | ~65.93 |
| DML (VGG16) [ | / | 91.73 ± 0.21 | 93.47 ± 0.30 | ~134.44 | ~15.60 | ~512.87 |
| SAL-TS-Net (GoogLeNet) [ | / | 85.02 ± 0.20 | 87.01 ± 0.19 | ~10.07 | ~1.51 | ~38.41 |
| BiMobileNet | 0.50 | 90.26 ± 0.23 | 92.77 ± 0.14 | 3.47 | 0.17 | 13.27 |
| 0.75 | 91.47 ± 0.16 | 93.64 ± 0.12 | 5.52 | 0.33 | 21.05 | |
| 1.00 |
|
| 7.76 | 0.45 | 29.59 | |
| BiMobileNet | 0.50 | 90.06 ± 0.11 | 92.74 ± 0.11 |
|
|
|
| 0.75 | 91.23 ± 0.09 | 93.67 ± 0.05 | 1.59 | 0.24 | 6.05 | |
| 1.00 | 91.89 ± 0.19 | 93.92 ± 0.11 | 2.52 | 0.34 | 9.59 | |
1 GFLOPs = 109 floating-point operations; k is the kernel size in bilinear pooling layer.