| Literature DB >> 35885455 |
Xuxin Chen1, Ke Zhang1,2, Neman Abdoli1, Patrik W Gilley1, Ximin Wang3, Hong Liu1, Bin Zheng1, Yuchen Qiu1.
Abstract
Deep convolutional neural networks (CNNs) have been widely used in various medical imaging tasks. However, due to the intrinsic locality of convolution operations, CNNs generally cannot model long-range dependencies well, which are important for accurately identifying or mapping corresponding breast lesion features computed from unregistered multiple mammograms. This motivated us to leverage the architecture of Multi-view Vision Transformers to capture long-range relationships of multiple mammograms from the same patient in one examination. For this purpose, we employed local transformer blocks to separately learn patch relationships within four mammograms acquired from two-view (CC/MLO) of two-side (right/left) breasts. The outputs from different views and sides were concatenated and fed into global transformer blocks, to jointly learn patch relationships between four images representing two different views of the left and right breasts. To evaluate the proposed model, we retrospectively assembled a dataset involving 949 sets of mammograms, which included 470 malignant cases and 479 normal or benign cases. We trained and evaluated the model using a five-fold cross-validation method. Without any arduous preprocessing steps (e.g., optimal window cropping, chest wall or pectoral muscle removal, two-view image registration, etc.), our four-image (two-view-two-side) transformer-based model achieves case classification performance with an area under ROC curve (AUC = 0.818 ± 0.039), which significantly outperforms AUC = 0.784 ± 0.016 achieved by the state-of-the-art multi-view CNNs (p = 0.009). It also outperforms two one-view-two-side models that achieve AUC of 0.724 ± 0.013 (CC view) and 0.769 ± 0.036 (MLO view), respectively. The study demonstrates the potential of using transformers to develop high-performing computer-aided diagnosis schemes that combine four mammograms.Entities:
Keywords: breast cancer; classification; computer-aided diagnosis; deep learning; mammogram; multi-view; self-attention; transformer
Year: 2022 PMID: 35885455 PMCID: PMC9320758 DOI: 10.3390/diagnostics12071549
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1Overview of using transformers for multi-view mammogram analysis. For a patient, each of the LCC, RCC, LMLO, and RMLO mammogram is split into image patches and mapped to embedding vectors in the latent space. Then, positional embeddings are added to patch embeddings. The sequence of embedding vectors is sent into local transformer blocks to learn within-mammogram dependencies. Weights of local transformer blocks are shared among the four mammograms. The four outputs are concatenated into one sequence and fed into global transformer blocks to learn inter-mammogram dependencies. The class token of the last global transformer block is sent into an MLP head to classify the case as benign/malignant.
Figure 2(a) The transformer block consists of an MSA, an MLP, skip connections, and layer normalizations. (b) Self-attention (Scaled Dot-Product Attention). Matmul: multiplication of two matrices. (c) Multi-head attention consists of multiple parallel self-attention heads. Concat: concatenation of feature representations. h: the number of self-attention heads.
Summary of model-generated case classification accuracy (ACC) along with the standard deviation (STD) in five-fold cross-validation on the unregistered, four-view mammogram dataset (LCC, RCC, LMLO, RMLO), with different numbers of local and global transformer blocks (using a small DeiT model).
| Local blocks | 0 | 2 | 4 | 8 | 12 |
| Global blocks | 12 | 10 | 8 | 4 | 0 |
| Fold 1 | 79.5 | 78.9 | 78.9 | 73.7 | 73.2 |
| Fold 2 | 74.7 | 77.4 | 77.4 | 73.7 | 74.7 |
| Fold 3 | 74.7 | 76.3 | 76.3 | 72.6 | 68.4 |
| Fold 4 | 74.7 | 75.8 | 75.3 | 74.2 | 71.1 |
| Fold 5 | 73.5 | 76.7 | 73.5 | 73.2 | 74.6 |
| Mean ACC (%) ± STD | 75.4 ± 2.3 |
| 76.3 ± 2.0 | 73.5 ± 0.6 | 72.4 ± 2.7 |
Figure 3Illustration of 6 ROC curves of single-view transformers, multi-view transformers, and CNNs.
Figure 4Confusion matrix of the two-view tiny DeiT’s classification results.
Result comparisons of the proposed model and other models. Params means the number of parameters. To compute p-values for different two-view models, the model proposed in paper [16] is used as the reference base.
| Model | Params | CC | MLO | Accuracy | Precision | Recall | Specificity | F1 Score | AUC | |
|---|---|---|---|---|---|---|---|---|---|---|
| Single-view | 5.5 M | ✓ | 69.6 ± 1.4 | 0.708 ± 0.035 | 0.662 ± 0.040 | 0.728 ± 0.058 | 0.683 ± 0.013 | 0.724 ± 0.013 | <0.001 | |
| Single-view | 5.5 M | ✓ | 72.5 ± 2.8 | 0.728 ± 0.037 | 0.713 ± 0.027 | 0.737 ± 0.045 | 0.720 ± 0.028 | 0.769 ± 0.036 | 0.376 | |
| Two-view DeiT-tiny ( | 5.5 M | ✓ | ✓ |
|
|
|
|
|
| 0.031 |
| Two-view DeiT-small ( | 21.7 M | ✓ | ✓ | 76.3 ± 2.8 |
| 0.706 ± 0.033 |
| 0.747 ± 0.018 |
| 0.009 |
| CNN feature | 44.7 M | ✓ | ✓ | 73.9 ± 2.4 | 0.761 ± 0.05 | 0.696 ± 0.041 | 0.781 ± 0.071 | 0.725 ± 0.019 | 0.784 ± 0.016 | - |
| View-wise | 22.4 M | ✓ | ✓ | 71.7 ± 2.1 | 0.735 ± 0.04 | 0.677 ± 0.051 | 0.756 ± 0.069 | 0.702 ± 0.021 | 0.759 ± 0.023 | N/A |