| Literature DB >> 35540957 |
Gan Cai1, Yu Zhu1, Yue Wu1, Xiaoben Jiang1, Jiongyao Ye1, Dawei Yang2,3.
Abstract
Skin disease cases are rising in prevalence, and the diagnosis of skin diseases is always a challenging task in the clinic. Utilizing deep learning to diagnose skin diseases could help to meet these challenges. In this study, a novel neural network is proposed for the classification of skin diseases. Since the datasets for the research consist of skin disease images and clinical metadata, we propose a novel multimodal Transformer, which consists of two encoders for both images and metadata and one decoder to fuse the multimodal information. In the proposed network, a suitable Vision Transformer (ViT) model is utilized as the backbone to extract image deep features. As for metadata, they are regarded as labels and a new Soft Label Encoder (SLE) is designed to embed them. Furthermore, in the decoder part, a novel Mutual Attention (MA) block is proposed to better fuse image features and metadata features. To evaluate the model's effectiveness, extensive experiments have been conducted on the private skin disease dataset and the benchmark dataset ISIC 2018. Compared with state-of-the-art methods, the proposed model shows better performance and represents an advancement in skin disease diagnosis.Entities:
Keywords: Attention; Deep learning; Multimodal fusion; Skin disease; Transformer
Year: 2022 PMID: 35540957 PMCID: PMC9070977 DOI: 10.1007/s00371-022-02492-4
Source DB: PubMed Journal: Vis Comput ISSN: 0178-2789 Impact factor: 2.835
Fig. 1The overall architecture of the model
Comparison of the performance of networks on the private dataset
| Methods | Acc | Sen | Spe | F1 | AUC |
|---|---|---|---|---|---|
| DenseNet121 | 0.667 | 0.722 | 0.943 | 0.664 | 0.889 |
| ResNet101 | 0.662 | 0.726 | 0.950 | 0.673 | 0.889 |
| ViT-B | 0.594 | 0.623 | 0.935 | 0.587 | 0.850 |
| Swin-B | 0.693 | 0.736 | 0.966 | 0.695 | 0.918 |
| NesT-B |
The bold values represent the best results in the metrics. This can make more readable and make it easier to compare the results of different methods. This also highlights the effectiveness of the proposed method
Fig. 3Comparison of Soft Label Encoder with One-hot Encoder
Fig. 2Forward propagation of One-hot encoded vectors in MLP
Fig. 4Mutual Attention block
Fig. 5Multi-head Cross Attention
Skin disease images
Fig. 6The convergence graphs on the private dataset (a) and public dataset (b)
Comparison of different metadata encoders (private dataset)
| Methods | Acc1 | Acc2 | Spe | Sen | F1 | AUC |
|---|---|---|---|---|---|---|
| Word2vec | 0.313 | 0.750 | 0.967 | 0.746 | 0.718 | 0.944 |
| One-hot Encoder | 0.401 | 0.763 | 0.962 | 0.783 | 0.745 | 0.947 |
| Soft Label Encoder |
The bold values represent the best results in the metrics. This can make more readable and make it easier to compare the results of different methods. This also highlights the effectiveness of the proposed method
Comparison of different fusion methods (private dataset)
| Methods | Acc | Spe | Sen | F1 | AUC |
|---|---|---|---|---|---|
| No metadata | 0.750 | 0.946 | 0.746 | 0.716 | 0.944 |
| Element-wise concat | 0.777 | 0.9732 | 0.804 | 0.788 | 0.964 |
| Element-wise multiply | 0.762 | 0.9718 | 0.801 | 0.756 | 0.971 |
| MFB [ | 0.756 | 0.9705 | 0.794 | 0.747 | 0.954 |
| BAN [ | 0.746 | 0.9709 | 0.795 | 0.750 | 0.967 |
| CrossViT [ | 0.750 | 0.9677 | 0.783 | 0.746 | 0.954 |
| MetaBlock [ | 0.786 | 0.9728 | 0.823 | 0.807 | 0.968 |
| Mutual attention |
The bold values represent the best results in the metrics. This can make more readable and make it easier to compare the results of different methods. This also highlights the effectiveness of the proposed method
Fig. 7The results of the proposed model on the private dataset: (a) ROC curve, (b) Confusion matrix
Comparison on ISIC 2018
| Methods | Acc | Sen | Spe | F1 | AUC |
|---|---|---|---|---|---|
| Multi-model [ | 0.8980 | 0.8971 | / | 0.8992 | 0.978 |
| MobileNet [ | 0.9270 | 0.7242 | 0.9714 | 0.7277 | 0.96 |
| DenseNet [ | 0.8580 | 0.6904 | 0.9592 | / | 0.88 |
| Semi-supervised [ | 0.9254 | 0.7147 | 0.9272 | 60.68 | 0.9358 |
| Transfer-learning [ | 0.914 | 0.8374 | / | / | 0.974 |
| MAT [ | 0.9255 | / | / | / | 0.98 |
| Ours |
The bold values represent the best results in the metrics. This can make more readable and make it easier to compare the results of different methods. This also highlights the effectiveness of the proposed method
Fig. 8The results of the proposed model on ISIC 2018: (a) ROC curve, (b) Confusion matrix
Ablation experiments (ISIC 2018)
| SLE | MA | Acc | Sen | Spe | AUC |
|---|---|---|---|---|---|
| 0.9206 | 0.8825 | 0.9795 | 0.9896 | ||
| ✓ | 0.9281 | 0.8967 | 0.9812 | 0.9920 | |
| ✓ | 0.9286 | 0.8893 | 0.9824 | 0.9918 | |
| ✓ | ✓ | 0.9381 | 0.9014 | 0.9836 | 0.9932 |