| Literature DB >> 35239108 |
Saul Calderon-Ramirez1,2, Diego Murillo-Hernandez3,4, Kevin Rojas-Salazar3,4, David Elizondo5, Shengxiang Yang5, Armaghan Moemeni6, Miguel Molina-Cabello7,8.
Abstract
The implementation of deep learning-based computer-aided diagnosis systems for the classification of mammogram images can help in improving the accuracy, reliability, and cost of diagnosing patients. However, training a deep learning model requires a considerable amount of labelled images, which can be expensive to obtain as time and effort from clinical practitioners are required. To address this, a number of publicly available datasets have been built with data from different hospitals and clinics, which can be used to pre-train the model. However, using models trained on these datasets for later transfer learning and model fine-tuning with images sampled from a different hospital or clinic might result in lower performance. This is due to the distribution mismatch of the datasets, which include different patient populations and image acquisition protocols. In this work, a real-world scenario is evaluated where a novel target dataset sampled from a private Costa Rican clinic is used, with few labels and heavily imbalanced data. The use of two popular and publicly available datasets (INbreast and CBIS-DDSM) as source data, to train and test the models on the novel target dataset, is evaluated. A common approach to further improve the model's performance under such small labelled target dataset setting is data augmentation. However, often cheaper unlabelled data is available from the target clinic. Therefore, semi-supervised deep learning, which leverages both labelled and unlabelled data, can be used in such conditions. In this work, we evaluate the semi-supervised deep learning approach known as MixMatch, to take advantage of unlabelled data from the target dataset, for whole mammogram image classification. We compare the usage of semi-supervised learning on its own, and combined with transfer learning (from a source mammogram dataset) with data augmentation, as also against regular supervised learning with transfer learning and data augmentation from source datasets. It is shown that the use of a semi-supervised deep learning combined with transfer learning and data augmentation can provide a meaningful advantage when using scarce labelled observations. Also, we found a strong influence of the source dataset, which suggests a more data-centric approach needed to tackle the challenge of scarcely labelled data. We used several different metrics to assess the performance gain of using semi-supervised learning, when dealing with very imbalanced test datasets (such as the G-mean and the F2-score), as mammogram datasets are often very imbalanced. Graphical Abstract Description of the test-bed implemented in this work. Two different source data distributions were used to fine-tune the different models tested in this work. The target dataset is the in-house CR-Chavarria-2020 dataset.Entities:
Keywords: Breast cancer; Data imbalance; Mammogram; Semi-supervised deep learning; Transfer learning
Mesh:
Year: 2022 PMID: 35239108 PMCID: PMC8892413 DOI: 10.1007/s11517-021-02497-6
Source DB: PubMed Journal: Med Biol Eng Comput ISSN: 0140-0118 Impact factor: 3.079
Fig. 1Diagram of experimental configurations presented in this work
Summary of datasets used in this work
| INbreast [ | CBIS-DDSM [ | Target CR dataset | |
|---|---|---|---|
| Origin | Portugal | USA | Costa Rica |
| Year | 2011 | 1997–2016 | 2020 |
| Number of cases | 115 | 1566 | 87 |
| Number of images | 410 | 3103 | 341 |
| Views | CC | CC | CC |
| MLO | MLO | MLO | |
| Image mode | Full-field digital | Digitized screen-film | Full-field digital |
| Categories | BI-RADS | BI-RADS | BI-RADS |
| ACR Density | ACR Density | ||
| Verified Pathology | |||
| ROI annotations | Yes | Yes | No |
Fig. 9Examples of benign (top) and malign (bottom) mammogram images from each dataset
Fig. 2BI-RADS categories distribution for
Fig. 3Binary categories distribution for
Fig. 4Craniocaudal (CC) and Mediolateral Oblique (MLO) views distribution for complete and binary-labelled target datasets
Fig. 5Depicted breast distribution for complete and binary-labelled target datasets
Fig. 6Age distribution for patients in
Fig. 7Age distribution according to BI-RADS categories for patients in
Fig. 8Age distribution according to binary categories for patients in
Fig. 10Examples of images from original CR data discarded due to image quality (top) or patients with breast implants (bottom)
Fig. 11Examples of images with background noise from CBIS-DDSM dataset, before and after being preprocessed
Classification performance for models of configuration S+No-FT, using the VGG-19 architecture
| Metric | INbreast models | CBIS-DDSM models | ||
|---|---|---|---|---|
|
|
| |||
| G-Mean | 0.3773 | 0.1043 | 0.3476 | 0.2534 |
| F2-Score | 0.1882 | 0.0625 | 0.1347 | 0.1148 |
| Accuracy | 0.2183 | 0.0602 | 0.7379 | 0.0678 |
| Recall | 0.7667 | 0.2509 | 0.2333 | 0.1876 |
| Specificity | 0.1901 | 0.0558 | 0.7639 | 0.0707 |
| Precision | 0.0470 | 0.0160 | 0.0517 | 0.0467 |
Classification performance for models of configuration SSDL, using the VGG-19 architecture
| Metric | ||||||
|---|---|---|---|---|---|---|
|
|
|
| ||||
| G-Mean | 0.4798 | 0.1936 | 0.5720 | 0.1257 | 0.6413 | 0.0929 |
| F2-Score | 0.2169 | 0.1194 | 0.2683 | 0.1168 | 0.3038 | 0.0889 |
| Accuracy | 0.5786 | 0.2212 | 0.6482 | 0.2172 | 0.6869 | 0.1412 |
| Recall | 0.5167 | 0.2687 | 0.5750 | 0.2648 | 0.6333 | 0.2297 |
| Specificity | 0.5815 | 0.2404 | 0.6518 | 0.2354 | 0.6904 | 0.1544 |
| Precision | 0.1189 | 0.1551 | 0.1079 | 0.0754 | 0.1096 | 0.0491 |
Summary of G-Mean scores for models of Configs. Bold entries refer to the best result between supervised and SSDL models
| Model | INbreast | CBIS-DDSM | Trainable | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Architecture | SSDL | Supervised | SSDL | Supervised | parameters | ||||
|
|
|
|
| ||||||
| VGG19_bn | 0.1084 | 0.6682 | 0.0770 | 0.0742 | 0.5163 | 0.2826 | 139.5 million | ||
| ResNet-152 | 0.1167 | 0.6767 | 0.1021 | 0.1075 | 0.5857 | 0.0598 | 58.1 million | ||
| EfficientNet-b0 | 0.1081 | 0.6393 | 0.0603 | 0.0753 | 0.5824 | 0.0489 | 4 million | ||
SSDL+FT and S+FT, using labelled observations. The corresponding number of trainable parameters for the PyTorch-implementation of each architecture is also shown
Results of configurations SSDL+FT and S+FT, using INbreast as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models
| Metric | SSDL | Supervised | |||
|---|---|---|---|---|---|
|
| |||||
| 20 | G-Mean | 0.1084 | 0.6682 | 0.0770 | |
| F2-Score | 0.0973 | 0.3133 | 0.0673 | ||
| Accuracy | 0.0727 | 0.7014 | 0.0793 | ||
| Recall | 0.5917 | 0.1687 | 0.1748 | ||
| Specificity∗ | 0.0755 | 0.7048 | 0.0876 | ||
| Precision | 0.0636 | 0.1074 | 0.0335 | ||
| 40 | G-Mean | 0.0932 | 0.6656 | 0.0877 | |
| F2-Score | 0.0899 | 0.3484 | 0.1112 | ||
| Accuracy | 0.0659 | 0.7224 | 0.1590 | ||
| Recall | 0.1715 | 0.6417 | 0.2081 | ||
| Specificity | 0.0693 | 0.7262 | 0.1721 | ||
| Precision | 0.1380 | 0.0373 | 0.1708 | ||
| 60 | G-Mean | 0.0957 | 0.6604 | 0.0876 | |
| F2-Score | 0.3278 | 0.0958 | 0.1116 | ||
| Accuracy | 0.7211 | 0.1169 | 0.1374 | ||
| Recall | 0.1318 | 0.6000 | 0.1748 | ||
| Specificity | 0.7267 | 0.1230 | 0.1466 | ||
| Precision | 0.1226 | 0.0565 | 0.1704 | ||
* Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models
Results of configurations SSDL+FT and S+FT, using CBIS-DDSM as source dataset with the VGG-19 architecture. Bold entries refer to the best result between supervised and SSDL models
| Metric | SSDL | Supervised | |||
|---|---|---|---|---|---|
|
|
| ||||
| 20 | G-Mean∗ | 0.0742 | 0.5163 | 0.2826 | |
| F2-Score | 0.0909 | 0.2892 | 0.1797 | ||
| Accuracy∗ | 0.7455 | 0.1115 | 0.0710 | ||
| Recall∗ | 0.1459 | 0.3917 | 0.2292 | ||
| Specificity∗ | 0.7460 | 0.1201 | 0.0709 | ||
| Precision | 0.1480 | 0.0551 | 0.1289 | ||
| 40 | G-Mean∗ | 0.0909 | 0.5743 | 0.2308 | |
| F2-Score∗ | 0.1124 | 0.3070 | 0.1597 | ||
| Accuracy∗ | 0.7588 | 0.1041 | 0.0476 | ||
| Recall∗ | 0.1632 | 0.4417 | 0.2189 | ||
| Specificity∗ | 0.7612 | 0.1110 | 0.0453 | ||
| Precision | 0.0630 | 0.1458 | 0.0899 | ||
| 60 | G-Mean | 0.0717 | 0.6466 | 0.1462 | |
| F2-Score | 0.1001 | 0.3436 | 0.1506 | ||
| Accuracy∗ | 0.7197 | 0.1445 | 0.0723 | ||
| Recall | 0.1459 | 0.5333 | 0.2297 | ||
| Specificity∗ | 0.7190 | 0.1559 | 0.0779 | ||
| Precision | 0.1435 | 0.0623 | 0.0834 | ||
* Statistic significance (p −values < 0.05) for average differences between results of SSDL and supervised models