| Literature DB >> 36268144 |
Wei Wang1, Xiaowei Wen1, Xin Wang1, Chen Tang1, Jiwei Deng1.
Abstract
Remote-sensing image scene data contain a large number of scene images with different scales. Traditional scene classification algorithms based on convolutional neural networks are difficult to extract complex spatial distribution and texture information in images, resulting in poor classification results. In response to the above problems, we introduce the vision transformer network structure with strong global modeling ability into the remote-sensing image scene classification task. In this paper, the parallel network structure of the local-window self-attention mechanism and the equivalent large convolution kernel is used to realize the spatial-channel modeling of the network so that the network has better local and global feature extraction performance. Experiments on the RSSCN7 dataset and the WHU-RS19 dataset show that the proposed network can improve the accuracy of scene classification. At the same time, the effectiveness of the network structure in remote-sensing image classification tasks is verified through ablation experiments, confusion matrix, and heat map results comparison.Entities:
Mesh:
Year: 2022 PMID: 36268144 PMCID: PMC9578839 DOI: 10.1155/2022/2661231
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1(a) and (b) Examples of object size and number variation in remote-sensing images.
Figure 2Network structure diagram.
Comparison of parameters and calculations of the model.
| Model | FLOPs (G) | Parameter (M) |
|---|---|---|
| ResNet50 [ | 5367.48 | 23.52 |
| Vgg16 [ | 20185.96 | 138.36 |
| DenseNet [ | 3742.61 | 7.98 |
| GoogleNet [ | 2071.13 | 6.99 |
| ViT-Ti [ | 21980.16 | 86.38 |
| Swin transformer [ | 7078.50 | 28.24 |
| VAN [ | 1149.08 | 41.00 |
| Conformer-Ti [ | 3241.03 | 11.31 |
| CMT [ | 1580.74 | 8.17 |
| MPViT [ | 4212.25 | 6.08 |
| CAW | 566.61 | 1.27 |
Experimental platform data.
| Attributes | Configuration information |
|---|---|
| Operating system | Windows 10 |
| CPU | Intel(R) Core (TM) i5-10300H CPU @ 2.50 GHz |
| GPU | GeForce RTX 2060 |
| CUDA | CUDA 11.6.110 |
| Frame | PyTorch 3.7 |
Comparison results of CAW-Net networks with different depths.
| Model | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F1-score (%) |
|---|---|---|---|---|---|
| CAW (1, 2, 1) | 95.18 | 95.24 | 95.17 | 99.21 | 95.19 |
| CAW (1, 2, 2) | 95.54 | 95.39 | 95.54 | 99.29 | 95.53 |
| CAW (1, 2, 3) | 95.35 | 95.40 | 95.34 | 99.24 | 95.37 |
| CAW (1, 1, 1) |
|
|
|
|
|
Comparison of parameters and calculations of CAW-Net with different depths.
| Model | FLOPs (G) | Parameter (M) |
|---|---|---|
| CAW (1, 2, 1) | 656.68 | 1.58 |
| CAW (1, 2, 2) | 695.12 | 2.03 |
| CAW (1, 2, 3) | 733.55 | 2.48 |
| CAW (1, 1, 1) | 566.61 | 1.27 |
Comparison results of parallel networks with different structures.
| Model | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F1-score (%) |
|---|---|---|---|---|---|
| Swin transformer-only | 95.71 | 95.84 | 95.73 | 99.30 | 95.71 |
| VAN-only | 94.69 | 94.69 | 94.64 | 99.14 | 94.63 |
| No shuffle | 95.36 | 95.39 | 95.34 | 99.24 | 95.36 |
| Point convolution | 95.53 | 95.64 | 95.53 | 99.27 | 95.56 |
| CAW |
|
|
|
|
|
Figure 3RSSCN7 dataset classification confusion matrix.
Figure 4WHU-RS19 dataset classification confusion matrix.
Overall accuracy and other parameters of the method on the RSSCN7 dataset.
| Model | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F1-score (%) |
|---|---|---|---|---|---|
| ResNet50 [ | 94.46 | 94.59 | 94.09 | 99.09 | 94.49 |
| Vgg16 [ | 93.75 | 93.79 | 93.76 | 98.99 | 93.71 |
| GoogleNet [ | 93.57 | 93.61 | 93.57 | 98.93 | 93.56 |
| DenseNet [ | 93.21 | 93.34 | 93.21 | 98.89 | 93.21 |
| ViT-Ti [ | 90.89 | 90.89 | 90.89 | 98.49 | 90.89 |
| Swin transformer [ | 93.93 | 93.96 | 93.91 | 99.00 | 93.93 |
| VAN [ | 94.11 | 94.17 | 94.11 | 99.03 | 94.11 |
| Conformer-Ti [ | 95.00 | 95.06 | 95.00 | 99.20 | 95.00 |
| CMT [ | 94.82 | 95.06 | 94.83 | 99.14 | 94.81 |
| MPViT [ | 95.00 | 95.03 | 95.00 | 99.19 | 95.00 |
| CAW |
|
|
|
|
|
Overall accuracy and other parameters of the method on the WHU-RS19 dataset.
| Model | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F1-score (%) |
|---|---|---|---|---|---|
| ResNet50 [ | 94.66 | 95.15 | 94.62 | 99.71 | 94.62 |
| Vgg16 [ | 94.66 | 95.44 | 94.61 | 99.71 | 94.78 |
| GoogleNet [ | 90.29 | 90.89 | 90.50 | 99.47 | 90.27 |
| DenseNet [ | 95.15 | 96.08 | 95.14 | 99.73 | 95.34 |
| ViT-Ti [ | 82.04 | 83.74 | 82.11 | 99.01 | 82.28 |
| Swin transformer [ | 91.26 | 92.25 | 91.35 | 99.52 | 91.25 |
| VAN [ | 93.67 | 94.44 | 93.72 | 99.66 | 93.59 |
| Conformer-Ti [ | 95.63 | 95.75 | 95.54 | 99.76 | 95.55 |
| CMT [ | 95.63 | 96.18 | 95.68 | 99.76 | 95.77 |
| MPViT [ | 95.63 | 95.93 | 95.75 | 99.76 | 95.69 |
| CAW |
|
|
|
|
|
Figure 5The visualization results (heat maps) of the CAW, Swin transformer, and Van models.