| Literature DB >> 35891109 |
Hongjun Wu1,2, Cheng Xu1,2, Hongzhe Liu1,2.
Abstract
Multi-label aerial scene image classification is a long-standing and challenging research problem in the remote sensing field. As land cover objects usually co-exist in an aerial scene image, modeling label dependencies is a compelling approach to improve the performance. Previous methods generally directly model the label dependencies among all the categories in the target dataset. However, most of the semantic features extracted from an image are relevant to the existing objects, making the dependencies among the nonexistant categories unable to be effectively evaluated. These redundant label dependencies may bring noise and further decrease the performance of classification. To solve this problem, we propose S-MAT, a Semantic-driven Masked Attention Transformer for multi-label aerial scene image classification. S-MAT adopts a Masked Attention Transformer (MAT) to capture the correlations among the label embeddings constructed by a Semantic Disentanglement Module (SDM). Moreover, the proposed masked attention in MAT can filter out the redundant dependencies and enhance the robustness of the model. As a result, the proposed method can explicitly and accurately capture the label dependencies. Therefore, our method achieves CF1s of 89.21%, 90.90%, and 88.31% on three multi-label aerial scene image classification benchmark datasets: UC-Merced Multi-label, AID Multi-label, and MLRSNet, respectively. In addition, extensive ablation studies and empirical analysis are provided to demonstrate the effectiveness of the essential components of our method under different factors.Entities:
Keywords: aerial scene classification; label correlation; multi-label learning; redundancy removing; semantic disentanglement
Mesh:
Year: 2022 PMID: 35891109 PMCID: PMC9317133 DOI: 10.3390/s22145433
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Illustration of the effect of mask attention.
Figure 2The overall framework of our proposed method.
Figure 3Illustration of the masked attention.
Comparisons of our method with previous state-of-the-art methods on the UC-Merced Multi-label dataset. Among all the metrics, is the primary metric. The bold means the best performance. All metrics are in %.
| Method | CF1 | CP | CR | OP | OR |
|---|---|---|---|---|---|
| ResNet50 [ | 79.51 | 88.52 | 78.91 | 80.70 | 81.97 |
| ResNet-RBFNN [ | 80.58 | 86.21 | 83.72 | 79.92 | 84.59 |
| CA-ResNet-BiLSTM [ | 81.47 | 86.12 | 84.26 | 77.94 | 89.02 |
| CM-GM-N-R-BiLSTM [ | 81.58 | 88.57 | 85.20 | 81.60 | 89.65 |
| AL-RN-ResNet50 [ | 86.76 |
| 87.07 | 86.12 | 84.26 |
| MLRSSC-CNN-GNN [ | 86.39 | 87.11 | 88.41 | - | - |
| ResNet50-SR-Net [ | 88.67 | 87.96 | 89.40 |
| 91.51 |
|
|
| 87.97 |
| 92.94 |
|
Comparisons of our method with previous state-of-the-art methods on the AID Multi-label dataset. Among all the metrics, is the primary metric. The bold means the best performance. All metrics are in %.
| Method | CF1 | CP | CR | OP | OR |
|---|---|---|---|---|---|
| ResNet50 [ | 86.23 | 89.31 | 85.65 | 72.39 | 52.82 |
| ResNet-RBFNN [ | 83.77 | 82.84 | 88.32 | 60.85 | 70.45 |
| CA-ResNet-BiLSTM [ | 87.63 | 89.03 | 88.95 | 79.50 | 65.60 |
| AL-RN-ResNet50 [ | 88.72 | 91.00 | 88.95 | 80.81 | 71.12 |
| MLRSSC-CNN-GNN [ | 88.64 | 89.83 | 90.20 | - | - |
| ResNet50-SR-Net [ | 89.97 | 89.42 |
| 87.24 |
|
|
|
|
| 89.69 |
| 80.70 |
Comparisons of our method with previous state-of-the-art methods on the MLRSNet dataset. ⋆ denotes our implementation. Among all the metrics, is the primary metric. The bold means the best performance. All metrics are in %.
| Method | CF1 | CP | CR | OP | OR |
|---|---|---|---|---|---|
| ResNet50 [ | 75.30 | - | - | - | - |
| ResNet50 | 81.35 | 80.85 | 81.56 | 82.19 | 82.70 |
| ResNet50-SR-Net [ | 87.21 | 87.08 | 87.34 | 88.79 | 86.73 |
|
|
|
|
|
|
|
| ResNet101 [ | 76.18 | - | - | - | - |
| ResNet101 | 81.89 | 81.42 | 82.03 | 82.65 | 82.89 |
| ResNet101-SR-Net [ | 87.55 | 87.84 | 87.26 | 89.41 | 87.48 |
|
|
|
|
|
|
|
Ablation studies on the essential components of our proposed method. The symbol ✓ represents the component in this column is in use. The red font denotes the improvement over the baseline. The bold means the best performance.
| Component | Prediction | CF1 | |||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| - | - | - | 79.51 | 86.23 | 81.35 |
|
| - | - | 82.35 | 87.16 | 83.46 |
|
| - |
| 84.18 | 87.78 | 84.71 |
| - |
| - | 85.61 | 88.26 | 85.93 |
|
|
| - | 87.59 | 90.14 | 86.48 |
|
|
|
| |||
Figure 4The results of the ablation studies on the number of attention heads in a layer in MAT.
Figure 5The results of ablation studies on the number of encoder layers in MAT.
Ablation studies on the position to apply masked attention. The symbol ✓ represents that the masked attention is in use in this layer. The bold means the best performance.
| Masked Attention | CF1 | ||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| - | - | - | 87.83 | 89.95 | 86.39 |
|
| - | - | 88.57 | 90.26 | 87.42 |
| - |
| - | 88.12 | 90.08 | 86.81 |
| - | - |
| 88.85 | 90.43 | 87.70 |
|
|
|
|
|
|
|
Figure 6The results of ablation studies on the selection of k in the generation of the mask in MAT.
Ablation studies on position embedding. The symbol ✓ represents that position embedding is in use. The bold means the best performance.
| Method | CF1 | ||
|---|---|---|---|
|
|
|
|
|
| - | 89.19 | 90.89 | 88.22 |
|
|
|
|
|
Figure 7Visualization of the class-specific activation map in the SDM on the AID Multi-label dataset.
Figure 8Visualization of the relation matrix in MAT on the AID Multi-label dataset.
Figure 9Qualitative results on the AID Multi-label dataset.