Literature DB >> 35885548

Diagnostic Accuracy of Machine Learning Models on Mammography in Breast Cancer Classification: A Meta-Analysis.

Tengku Muhammad Hanis1, Md Asiful Islam2,3, Kamarul Imran Musa1.   

Abstract

In this meta-analysis, we aimed to estimate the diagnostic accuracy of machine learning models on digital mammograms and tomosynthesis in breast cancer classification and to assess the factors affecting its diagnostic accuracy. We searched for related studies in Web of Science, Scopus, PubMed, Google Scholar and Embase. The studies were screened in two stages to exclude the unrelated studies and duplicates. Finally, 36 studies containing 68 machine learning models were included in this meta-analysis. The area under the curve (AUC), hierarchical summary receiver operating characteristics (HSROC) curve, pooled sensitivity and pooled specificity were estimated using a bivariate Reitsma model. Overall AUC, pooled sensitivity and pooled specificity were 0.90 (95% CI: 0.85-0.90), 0.83 (95% CI: 0.78-0.87) and 0.84 (95% CI: 0.81-0.87), respectively. Additionally, the three significant covariates identified in this study were country (p = 0.003), source (p = 0.002) and classifier (p = 0.016). The type of data covariate was not statistically significant (p = 0.121). Additionally, Deeks' linear regression test indicated that there exists a publication bias in the included studies (p = 0.002). Thus, the results should be interpreted with caution.

Entities:  

Keywords:  breast cancer; diagnostic accuracy; machine learning; mammography; meta-analysis

Year:  2022        PMID: 35885548      PMCID: PMC9320089          DOI: 10.3390/diagnostics12071643

Source DB:  PubMed          Journal:  Diagnostics (Basel)        ISSN: 2075-4418


1. Introduction

Breast cancer is the most commonly diagnosed cancer overall and among women worldwide; in fact, it has been identified as the fifth leading cause of cancer-related mortality globally in 2020 [1]. It is considered the most prevalent cancer worldwide [2]. The screening and diagnosis of breast cancer are carried out using multiple assessments, such as breast examination, mammography and biopsy. Different imaging modalities, such as mammography, ultrasound (US), magnetic resonance imaging (MRI), histological images and infrared thermography, have been used in breast cancer detection. Mammography is more commonly used for breast cancer screening. For example, women aged 40 years old and above are recommended to undergo a mammographic screening [3,4]. Mammography mainly consists of a digital mammogram and digital breast tomosynthesis (DBT). The digital mammogram is more commonly used for breast cancer detection; however, it is found to be less effective in patients with dense breasts and less sensitive to small tumors (tumors with a volume of less than 1 mm [5]). On the other hand, DBT or the three-dimensional mammogram, which is a more advanced technology of mammography, overcomes these disadvantages. Overall, it provides higher diagnostic accuracy than the two-dimensional mammogram [6]. However, no significant difference was noted between these two technologies when used for screening purposes [7]. Machine learning is expected to improve the area of health care, especially in medical specializations, such as diagnostic radiology, cardiology, ophthalmology and pathology [8]. Factors such as the availability of big medical data and advances in computing technology will help accelerate the use of machine learning in these medical areas. However, in spite of these positive developments, the practical implementation of machine learning in a clinical setting remains debatable [9,10,11]. Issues such as privacy concerns, lack of trust in the technology, machine learning interpretability and unintended bias of the technology are yet to be fully explored [8,12,13,14]. Machine learning had been researched to be used in the field of breast cancer in various ways, such as predicting and screening the disease [15], predicting the cancer recurrence [16], predicting survival of the patients [17], predicting the breast density and guiding treatments and management of the disease [18,19]. Different data sources, such as sociodemographic and clinical data, genomic data and imaging data, coupled with various machine learning techniques have been explored to be used in various clinical settings related to breast cancer. Thus, in brief, the use of machine learning in this research area can be categorized mainly into three roles, either as a screening, diagnostic or prognostic tool. These different roles of machine learning will affect how the model is built and deployed; however, most studies do not clearly emphasize the role of their machine learning model with regard to the clinical context and its practical application. The use of machine learning on digital mammograms and tomosynthesis mainly aims to be a screening tool or at most, a supplemental diagnostic tool to a radiologist. Previous studies of machine learning on medical images associated with breast cancer mostly used digital mammograms [20], while the use of tomosynthesis was not very common. A wide variety of machine learning techniques has been used on these medical images, resulting in a wide range of diagnostic accuracy. Thus, the performance difference in all the techniques makes it difficult to evaluate the benefit of these machine learning tools on mammography. Subsequently, the wide range of performance of the machine learning techniques may reduce the confidence of the clinicians in the tools. Therefore, this meta-analysis aims to establish the overall diagnostic accuracy of the machine learning model on digital mammograms and tomosynthesis. This study also aims to assess the factors affecting the diagnostic accuracy of the machine learning model and further perform subgroup analysis.

2. Materials and Methods

2.1. Overview

This study was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses of diagnostic test accuracy studies (PRISMA-DTA) [21] and Synthesising Evidence from Diagnostic Accuracy Tests (SEDATE) [22] guidelines and recommendations. Both checklists are presented in the Supplementary File.

2.2. Search Strategy

We searched the online databases of Scopus, PubMed, Google Scholar, Embase and Web of Science using predetermined search terms. The search was carried out on 17 August 2020 for Scopus, PubMed and Google Scholar databases. The search for Embase and Web of Science databases was conducted on 25 August 2020. All search terms for each database are presented in Supplementary Table S1. All the results were imported into Mendeley. Duplicate papers were automatically screened and deleted. Subsequently, a researcher (TMH) manually screened the results again and deleted the remaining duplicates that were not identified using Mendeley. We then divided the screening process into two phases. In the first phase, we applied more lenient selection criteria to screen out the more obvious articles that were not related to our study. A full text of all the articles that passed the first phase of the selection criteria was downloaded. Additionally, in the second phase, we applied more stringent selection criteria to the articles to fit our study’s objectives. Any inconsistency during the selection and extraction process was resolved by discussion and consensus among the researchers.

2.3. Selection Criteria

We divided the screening process into two phases. We mainly screened the titles and abstracts and, if needed, the full text in the first phase. We searched for the following groups of articles in the first phase: (1) articles related to breast cancer prediction or classification; (2) articles that used machine learning models or algorithms; (3) articles written in English; (4) articles that used digital mammogram or tomosynthesis data; and (5) articles that at least reported an accuracy value as a performance metrics; (6) peer-reviewed research articles, proceedings and theses were excluded. We screened all the articles using the full text in the second phase of the selection process. We selected the articles based on the following criteria: (1) articles that focused only on breast cancer classification models. Articles that compared feature extraction and segmentation methods were excluded. (2) Articles that reported a confusion matrix or at least had reported sufficient data. (3) Articles that had ensembles or hybrid machine learning models as classifiers were excluded. (4) Three-class prediction models were excluded unless a 2 × 2 confusion matrix was reported.

2.4. Data Extraction

We collectively extracted data from the included articles into a Microsoft Excel spreadsheet. The extracted variables were as follows: (1) title; (2) first author’s last name; (3) year of publication; (4) source of data; (5) country of the data used; (6) size of dataset; (7) number of data in the training, validation and testing split; (8) type of data; (9) sample size used; (10) classifier; (11) prediction class; (12) accuracy; (13) sensitivity; (14) specificity; and (15) confusion matrix. Additionally, more than one model was extracted from an article if the models used different data, classifiers or prediction classes. However, the model with the highest accuracy was extracted in the case of articles with relatively similar models.

2.5. Quality Assessment

We used the QUADAS-2 [23] tool to assess the quality of the studies that were included in the meta-analysis. The tool consisted of four domains, that is, patient selection, index test, reference standard, and flow and timing. All four domains were assessed regarding the risk of bias and only the first three domains were assessed regarding the applicability concerns. The risk of bias for each domain was determined using the signalling questions as entailed in the QUADAS-2 tool. Each signalling question was rated as ‘no’, ‘unclear’ or ‘yes’. The domains were considered a low risk of bias if all the signalling questions were rated ‘yes’. However, the domains were considered at a high risk of bias if one of the signalling questions was rated ‘no’ and none of the remaining signalling questions were rated ‘yes’. The domains, except for the previous two conditions, were considered an unclear risk of bias. Additionally, we added the overall rating to the QUADAS-2 assessment. We assigned the values of 1, 0 and −1, to low, unclear and high, respectively. Thus, the sum of the overall rating could range from −7 to 7. The overall quality was classified as very poor (−7 to −4), poor (−3 to 0), moderate (1 to 4) and good (5 to 7).

2.6. Outcomes

The primary outcomes were the overall diagnostic accuracy of the machine learning model in the form of the AUC and the hierarchical summary receiver operating characteristics (HSROC) curve. The secondary outcomes were the result of a likelihood ratio test for variables’ classifier, country of the data, source of data and type of the data. Variables with a p-value < 0.05 were considered statistically significant and followed up by a post hoc subgroup analysis.

2.7. Statistical Analysis

The statistical analysis was carried out using R version 4.1.0 [24]. The full R code is available on the GitHub website [25]. The main R packages used were mada and metafor [26,27]. A continuity correction of 0.5 was applied to the data if there were zero cells in the confusion matrix to avoid statistical artefacts. This approach is the default setting in the mada package. Each machine learning model was summarized by the pooled diagnostic odds ratio (DOR), sensitivity and specificity. The DOR represents the odds of a positive test result in diseased individuals compared to the odds of a positive result in healthy individuals. Thus, the DOR simply denotes the discriminant ability of the diagnostic test. Additionally, sensitivity represents the ability of the test to correctly identify affected individuals, while specificity reflects the ability of the test to correctly identify healthy individuals among the tested individuals. The pooled sensitivity, pooled specificity, AUC and HSROC curve parameters were estimated using the bivariate model of Reitsma et al. [28] through the mada package. The bivariate approach provides a better estimate, especially if a different cut-off threshold was used by each machine learning model to classify the positive and negative cases [22]. The 95% confidence interval of the AUC was estimated using a bootstrap method from the dmetatools package [29]. Heterogeneity assessment was conducted through visual inspection of the HSROC plot and the correlation between sensitivity and specificity. Inconsistency was suspected if the individual studies largely deviated from the HSROC line and the coefficient correlation of sensitivity and specificity was larger than zero [22,30]. The Cochran’s Q test and Higgins’ I2 statistics were not presented, as they were not suitable for heterogeneity assessments in diagnostic test accuracy studies [31]. A likelihood ratio test between the bivariate meta-regression models was carried out to compare a null model and a model with a covariate. Five bivariate meta-regression models were built, including the null model and models with a covariate of country, source, type of data and classifier. The country covariate indicated the country of origin of the data, while source covariate indicated whether the data were from a local database (primary data) or an online secondary database. The type of data covariate reflected the type of mammogram image and the classifier covariate reflected the different machine learning models included in this study. The likelihood ratio test with a p-value < 0.05 indicated that the model with a variable was better; thus, the variable was statistically significant. Subsequently, a post hoc subgroup analysis was performed for each significant variable. Pairwise comparisons of the AUC between each model of the subgroups were performed using a bootstrap method in the dmetatools package, and p-values were adjusted using the Bonferroni correction. A p-value below a threshold of 0.05 divided by the number of groups in each subgroup analysis indicated a significant comparison. A non-convergent result indicated that the model did not converge, even after 10,000 bootstrap resampling. Any subgroup model with a small number of studies was dropped from the subgroup analysis, as the estimates of the AUC and HSROC parameters were not reliable. An influential diagnostic analysis was performed to assess the overall diagnostic accuracy of the machine learning model using the dmetatools package. The influential diagnostic analysis was carried out using a leave-one-out approach to estimate the difference in the AUC. Publication bias was evaluated using Deeks’ regression test [32]. The approach of Deeks et al., had been considered the most appropriate one to assess the publication bias in a diagnostic test accuracy study [33]. p-values < 0.10 may indicate the presence of publication bias.

3. Results

3.1. Eligible Studies

In total, 2897 research articles were identified in the 5 databases, as presented in Figure 1. After the removal of 1115 duplicates, the remaining 1782 articles were included in the screening process. A total of 1346 articles were excluded during the whole screening process. The first screening process excluded 1157 articles, while the second screening process excluded another 189 papers. Finally, 36 studies containing 68 machine learning models were included in this study.
Figure 1

Flow diagram of the study selection process.

3.2. Study Characteristics

The main characteristics of the included studies are presented in Table 1. The years of publication of the 36 included studies ranged from 2006 to 2020. Eleven studies used primary data from their respective countries, while most studies used secondary databases, such as the Mammographic Image Analysis Society (MIAS), mini-MIAS and Digital Database for Screening Mammography (DDSM). Only one study used tomosynthesis images, while the remaining thirty-five used digital mammogram images. The three most common classifiers were neural network (23.5%), support vector machine (22.1%) and deep learning (20.6%).
Table 1

Characteristics of included studies.

StudyIDCountrySourceSize of DatasetTrain/Validation/Test SplitType of DataClassifierPrediction ClassTPTNFPFNAccuracy
Abdolmaleki 2006 [34]1IranPrimary data122 cases82/-/40DMNNBenign-Malignant 1614820.75
Acharyau 2008 [35]2USADDSM360 images270/-/90DMNNNormal-Benign-Malignant5528250.97
3USADDSM360 images270/-/90DMGMMNormal-Benign-Malignant5729130.98
Al-antari 2020 [36]4USADDSM600 images420/60/120DMDLBenign-Malignant 5959110.98
5PortugalINbreast410 images78/12/22DMDLBenign-Malignant 146200.95
Alfifi 2020 [37]6UKMIAS200 imagesNEDMDLNormal-Benign-Malignant12466730.95
7UKMIAS200 imagesNEDMTree-basedNormal-Benign-Malignant1025429150.78
8UKMIAS200 imagesNEDMKNNNormal-Benign-Malignant995032190.74
Al-hiary 2012 [38]9JordanPrimary dataNENEDMNNNormal-Cancer1415120.91
Al-masni 2018 [39]10USADDSM2400 images1920/-/480DMNNBenign-Malignant 2402261400.97
Bandeira-diniz 2018 [40]11USADDSM2482 images1990/-/492DMDLNon-mass-Mass241843064422250.91
12USADDSM2482 images1990/-/492DMDLNon-mass-Mass177456152101880.95
Barkana 2017 [41]13USADDSM2173 images1451/-/722DMNNBenign-Malignant 32527070570.82
14USADDSM2173 images1451/-/722DMSVMBenign-Malignant 31827862640.83
Biswas 2019 [42]15UKMIAS322 images226/48/48DMNNNormal-Abnormal3212310.92
Cai 2019 [43]16ChinaPrimary data990 images891/-/99DMSVMBenign-Malignant 4839660.89
Chen 2019a [44]17ChinaPrimary data81 casesNEDMTree-basedBenign-Malignant 31301190.75
Chen 2019b [45]18USAPrimary data275 cases10-folds cross validationDMSVMBenign-Malignant 10210437320.75
19USAPrimary data275 cases10-folds cross validationDMSVMBenign-Malignant 10311427310.79
Danala 2018 [46]20USAPrimary data111 casesLOO-CVDMDLBenign-Malignant 63249150.78
21USAPrimary data111 casesLOO-CVDMDLBenign-Malignant 552112230.68
Daniellopez-cabrera 2020 [47]22UKmini-MIAS322 imagesNEDMDLNormal-Abnormal31101240.97
23UKmini-MIAS322 imagesNEDMDLBenign-Malignant 1428310.91
Fathy 2019 [48]24USADDSM3932 images2517/629/786DMDLNormal-Abnormal3893257110.91
Girija 2019 [49]25UKmini-MIAS322 imagesNEDMTree-basedNormal-Abnormal26648440.98
26UKmini-MIAS322 imagesNEDMTree-basedBenign-Malignant 20055690.94
Jebamony 2020 [50]27UKmini-MIAS294 images203/-/91DMNNBenign-Malignant 33411250.85
28UKmini-MIAS294 images203/-/91DMSVMBenign-Malignant 3749410.96
Junior 2010 [51]29UKmini-MIAS428 ROIs320/-/108DMNNNormal-Abnormal16695180.79
30UKmini-MIAS428 ROIs320/-/108DMSVMNormal-Abnormal2080170.93
Kanchanamani 2016 [52]31UKMIAS322 imagesNEDMSVMNormal-Abnormal461202400.87
32UKMIAS322 imagesNEDMBayes-basedNormal-Abnormal309450160.65
33UKMIAS322 imagesNEDMDLNormal-Abnormal2310143230.65
34UKMIAS322 imagesNEDMKNNNormal-Abnormal2811232180.74
35UKMIAS322 imagesNEDMLDANormal-Abnormal2811232180.74
36UKMIAS322 imagesNEDMSVMBenign-Malignant 5853270.93
37UKMIAS322 imagesNEDMBayes-basedBenign-Malignant 502035150.58
38UKMIAS322 imagesNEDMDLBenign-Malignant 292926360.48
39UKMIAS322 imagesNEDMKNNBenign-Malignant 412530240.55
40UKMIAS322 imagesNEDMLDABenign-Malignant 383322270.59
Kim 2018 [53]41KoreaPrimary data29,107 images26631/1238/1238DMDLNormal-Abnormal471548711480.82
Mao 2019 [54]42ChinaPrimary data173 cases138/-/35DMSVMBenign-Malignant 1314170.80
43ChinaPrimary data173 cases138/-/35DMLogisticBenign-Malignant 1714130.89
44ChinaPrimary data173 cases138/-/35DMKNNBenign-Malignant 8141120.83
45ChinaPrimary data173 cases138/-/35DMBayes-basedBenign-Malignant 9132110.78
Miao 2015 [55]46USAMMD830 cases10-folds cross validationDMSVMBenign-Malignant 38139928220.94
Miao 2013 [56]47USAMMD830 casesNEDMNNBenign-Malignant 36038443430.90
Milosevic 2015 [57]48UKMIAS300 images5-folds cross validationDMSVMNormal-Abnormal2316324900.62
49UKMIAS300 images5-folds cross validationDMKNNNormal-Abnormal4413849690.61
50UKMIAS300 images5-folds cross validationDMBayes-basedNormal-Abnormal5311374600.55
51SerbiaPrimary data300 images5-folds cross validationDMSVMNormal-Abnormal12113020290.84
52SerbiaPrimary data300 images5-folds cross validationDMKNNNormal-Abnormal847971660.54
53SerbiaPrimary data300 images5-folds cross validationDMBayes-basedNormal-Abnormal11411832360.77
Nithya 2012 [58]54USADDSM250 images200/-/50DMNNNormal-Abnormal2324210.94
Nusantara 2016 [59]55UKMIAS322 images291/-/31DMKNNNormal-Abnormal1020010.97
Palantei 2017 [60]56UKMIASNENEDMSVMNormal-Abnormal921400.88
Paramkusham 2018 [61]57USADDSM148 images126/-/22DMSVMBenign-Malignant 1010110.91
Roseline 2018 [62]58UKMIASNENEDMKNNBenign-Malignant 4960420.95
Shah 2015 [63]59UKMIAS320 imagesNEDMNNNormal-Abnormal5449230.95
60UKMIAS320 imagesNEDMNNBenign-Malignant 2422260.85
Shivhare 2020 [64]61USA, UKDDSM, MIASNENEDMNNBenign-Malignant 1216230.85
62USA, UKDDSM, MIASNENEDMDLBenign-Malignant 1171140.55
63USA, UKDDSM, MIASNENEDMSVMBenign-Malignant 0180150.55
Singh 2018 [65]64UKMIAS139 ROIs69/28/42DMNNBenign-Malignant 2514120.93
Venkata 2019 [66]65NANA110 images80/-/30DMLogistic regressionBenign-Malignant 1414110.93
Wang 2017 [67]66UKmini-MIAS200 images10-folds cross validationDMNNNormal-Abnormal9292880.92
Wutsqa 2017 [68]67UKMIAS120 cases96/-/24DMNNNormal-Abnormal148020.92
Yousefi 2018 [69]68USAPrimary data87 imagesNETomosynthesisTree-basedBenign-Malignant 1113220.87

DM = digital mammogram; NN = neural network; GMM = Gaussian mixture model; DL = deep learning; KNN = k-nearest neighbor; SVM = support vector machine; LDA = linear discriminant analysis; ROIs = region of interests; LOO-CV = leave-one-out cross validation; NE = not clearly explained; NA = not available; TP = true positive; TN = true negative; FP = false positive; FN = false negative; DDSM = database for screening mammography; MIAS = mammographic image analysis society; MMD = mammographic mass database.

3.3. Descriptive Statistics

The study with the highest accuracy was the study carried out by Acharya U et al., in 2008 (98.3%), while that performed by Kanchanamani et al., in 2016 had the lowest accuracy (48.3%). The specificity and sensitivity values of each machine learning model are presented in Figure 2. Sensitivity values for machine learning models in this study ranged between 0.03 (95% CI: 0.00–0.24) and 1.00 (95% CI: 0.98–1.00), while specificity values ranged between 0.37 (95% CI: 0.25–0.50) and 0.98 (95% CI: 0.93–1.00). In this study, significant differences were observed between the sensitivity values (p < 0.001) and specificity values (p < 0.001) of machine learning models. The pooled DOR of the machine learning models was 28.34 (95% CI: 17.67–45.45), with the DOR value of each model ranging from 0.90 (95% CI: 0.44–1.84) to 7513.55 (95% CI: 445.61–126,689.03). Figure 3 presents the DOR values for each machine learning model in this study.
Figure 2

Sensitivity and specificity of machine learning models in the study.

Figure 3

The diagnostic odds ratio of machine learning models in the study.

3.4. Overall Model

The pooled area under the curve (AUC) estimated using the bivariate model of Reitsma et al. [28] for the overall machine learning models in this study was 0.90 (95% CI: 0.85–0.90). The HSROC curve plot is presented in Figure 4. Additionally, the pooled sensitivity and pooled specificity values estimated through the same model were 0.83 (95% CI: 0.78–0.87) and 0.84 (95% CI: 0.81–0.87), respectively.
Figure 4

Hierarchical summary receiver operating characteristics (HSROC) curve for overall machine learning models in the study.

3.5. Test for Heterogeneity and Influential Diagnostics

Based on the HSROC curve plot (Figure 4), there was a moderate deviation of the individual models from the curve. The correlation coefficient of the sensitivity and specificity was 0.33. Thus, there was an indication of slight-to-moderate heterogeneity in this study. However, the influential diagnostics indicated that there was no influential model in the study. The result of the influential diagnostics is presented in Supplementary Table S2.

3.6. Subgroup Analysis

As per our findings, three out of four covariates were found to be significant via a likelihood ratio test; these were country (p = 0.003), source (p = 0.002) and classifier (p = 0.016), while the type of data was not significant (p = 0.121). The detailed result of the likelihood test is presented in Table 2. Thus, the country, source and classifier explained some of the heterogeneity that can be observed in the study. A further subgroup analysis was performed on the three significant covariates. All countries other than the USA and the UK were combined into one group, due to the small number of available studies. Subsequently, the studies that used data from both the USA and UK were excluded due to a small number of available studies, and those studies did not fit into any other group. Pairwise post hoc comparison of the country subgroup revealed that machine learning models that used data from the USA performed better than models that used data from the other countries in terms of AUC (dAUC = 0.10, 95% CI: 0.04–0.19). Additionally, for the subgroup analysis of the classifier covariate, three classifiers that were dropped due to a small number of studies were the Gaussian mixture model (GMM), linear discriminant analysis (LDA) and logistic regression. The three significant pairwise comparisons for this subgroup analysis were the neural network and Bayes-based model (dAUC = 0.25, 95% CI: 0.12–0.38), tree-based model and Bayes-based model (dAUC = 0.25, 95% CI: 0.07–0.40) and support vector machine and Bayes-based model (dAUC = 0.22, 95% CI: 0.09–0.35). Lastly, for the subgroup analysis of the source covariate, we dropped studies that used the INbreast database and the mammographic mass database (MMD). We also dropped studies that used both DDSM and MIAS databases and studies with unknown sources of data. Studies that used the MIAS and mini-MIAS databases were further classified into a single group. All pairwise comparisons of the AUC were determined to be not significant in this subgroup analysis. All the aforementioned pairwise comparisons were significant after the Bonferroni correction, and there were six non-convergent pairwise comparisons. The results of the complete pairwise comparisons for all the three subgroups are presented in Table 3, while Figure 5 delineates the HSROC for the subgroups. The highest AUCs in each subgroup were models with the US data (AUC = 0.94), models that used the DDSM database (AUC = 0.97) and the neural network model (0.94). As shown in Figure 5, models that used the DDSM database performed significantly better than models that used primary data, while the other model comparisons were relatively similar to those in Table 3.
Table 2

A likelihood ratio test for bivariate meta-regression models with the null model.

Model Covariateꭓ2-Statistic (df)p-Value
Model 1Country 19.55 (6)0.003 *
Model 2Source31.10 (12)0.002 *
Model 3Type of data4.23 (2)0.121
Model 4Classifier 30.32 (16)0.016 *

* Significance at p < 0.05.

Table 3

A post hoc pairwise comparison for covariates country, source of data and classifier.

ComparisonsdAUC (95% CI)p-Value
Country
USA vs. UK0.051 (0.006, 0.127)0.035 *
USA vs. others 10.095 (0.044, 0.191)0.001 **
UK vs. others 10.044 (−0.034, 0.131)0.241
Source of data
Primary data vs. DDSM
Primary data vs. MIAS 2−0.062 (−0.127, 0.023)0.152
DDSM vs. MIAS 2
Classifier
NN vs. DL
NN vs. Tree-based0.003 (−0.071, 0.138)0.946
NN vs. KNN0.157 (0.026, 0.325)0.010
NN vs. SVM0.033 (−0.034, 0.074)0.337
NN vs. Bayes-based0.252 (0.119, 0.379)<0.001 **
DL vs. Tree-based−0.016 (−0.122, 0.117)0.690
DL vs. KNN
DL vs. SVM
DL vs. Bayes-based
Tree-based vs. KNN0.153 (−0.023, 0.333)0.082
Tree-based vs. SVM0.030 (−0.101, 0.099)0.578
Tree-based vs. Bayes-based0.249 (0.073, 0.395)0.007 **
KNN vs. SVM−0.123 (−0.300, −0.004)0.044 *
KNN vs. Bayes-based0.096 (−0.121, 0.265)0.404
SVM vs. Bayes-based0.219 (0.094, 0.350)<0.001 **

* Significance at p < 0.05; ** significance after Bonferroni correction; † non-convergence; 1 others: Iran, Portugal, Jordan, China, Korea and Serbia; 2 mini-MIAS and MIAS databases were combined into a group; dAUC = difference of the area under the curve; DDSM = database for screening mammography; MIAS = mammographic image analysis society; NN = neural network; DL = deep learning; KNN = k-nearest neighbor; SVM = support vector machine.

Figure 5

Hierarchical summary receiver operating characteristics (HSROC) curve for each subgroup analysis in the study.

3.7. Publication Bias

Deeks’ regression test was performed on the overall models that included all the 68 models from the 36 studies. The test indicated the possibility of publication bias in this study (p = 0.002). Figure 6 shows that Deeks’ funnel plot was asymmetrical.
Figure 6

Deeks’ funnel plot.

3.8. Quality Assessment

Table 4 shows the quality assessment of the 36 included studies using the updated Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. Generally, the majority of studies had an unclear risk of bias and low applicability concerns. Additionally, several studies with a high risk of bias were observed under the subdomains of ‘patient selection’ and ‘flow and timing’ of the risk of bias domain. Most studies used secondary databases and did not explain in detail the data selection process and flow of their studies. Items such as the consecutive or random sampling approach, inappropriate exclusion of the data and the proper interval between the index test and the reference standard were not clearly addressed in most of the included studies. Overall, out of the 36 studies included in the meta-analysis, 2 studies were found to be of poor quality, 9 studies of good quality and 25 studies of moderate quality.
Table 4

Quality assessment of the included studies according to the QUADAS-2 tool.

Study Risk of Bias Applicability Overall
Patient Selection Index Test Reference Standard Flow and Timing Patient Selection Index Test Reference Standard
Abdolmaleki 2006LowUnclearLowLowLowLowLowGood
Acharyau 2008HighUnclearLowUnclearLowLowLowGood
Al-antari 2020LowUnclearUnclearLowUnclearLowUnclearModerate
Alfifi 2020UnclearUnclearUnclearUnclearLowLowUnclearModerate
Al-hiary 2012HighLowUnclearUnclearUnclearLowUnclearModerate
Al-masni 2018LowUnclearLowUnclearLowLowLowModerate
Bandeira-diniz 2018HighLowLowUnclearLowLowLowGood
Barkana 2017UnclearUnclearLowUnclearUnclearLowLowModerate
Biswas 2019UnclearUnclearUnclearUnclearUnclearLowUnclearModerate
Cai 2019LowLowLowLowLowLowLowModerate
Chen 2019aLowUnclearLowLowLowLowLowModerate
Chen 2019bLowLowLowLowLowLowLowGood
Danala 2018LowLowLowLowLowLowLowGood
Daniellopez-cabrera 2020UnclearUnclearUnclearUnclearLowLowUnclearGood
Fathy 2019HighLowLowUnclearLowLowLowPoor
Girija 2019UnclearLowUnclearUnclearLowLowLowGood
Jebamony 2020UnclearUnclearUnclearHighLowLowUnclearModerate
Junior 2010HighUnclearUnclearHighLowLowUnclearModerate
Kanchanamani 2016UnclearUnclearUnclearUnclearLowLowUnclearModerate
Kim 2018UnclearLowLowLowLowLowLowModerate
Mao 2019LowUnclearLowLowLowLowLowModerate
Miao 2015UnclearUnclearUnclearHighLowLowUnclearModerate
Miao 2013LowLowUnclearHighLowLowUnclearModerate
Milosevic 2015LowUnclearUnclearUnclearLowLowUnclearModerate
Nithya 2012UnclearUnclearLowUnclearLowLowLowModerate
Nusantara 2016UnclearLowUnclearUnclearLowLowLowModerate
Palantei 2017HighUnclearUnclearUnclearLowLowUnclearPoor
Paramkusham 2018UnclearUnclearLowUnclearLowLowLowModerate
Roseline 2018UnclearUnclearUnclearHighLowLowUnclearModerate
Shah 2015UnclearUnclearUnclearUnclearLowLowUnclearGood
Shivhare 2020UnclearUnclearUnclearHighLowLowUnclearGood
Singh 2018UnclearUnclearLowLowLowLowLowModerate
Venkata 2019UnclearUnclearUnclearUnclearUnclearLowUnclearModerate
Wang 2017HighUnclearUnclearUnclearLowLowUnclearModerate
Wutsqa 2017HighUnclearUnclearUnclearLowLowUnclearModerate
Yousefi 2018UnclearUnclearLowUnclearLowLowLowModerate

4. Discussion

This study presents the efficacy of machine learning models on digital mammograms and tomosynthesis. According to our findings, machine learning models had good performance in breast cancer classification using digital mammograms and tomosynthesis, with pooled AUC of 0.90. A previous meta-analysis that analyzed different machine learning algorithms to estimate breast cancer risk was published in 2018 [70]. However, this study did not include deep learning methods and presented a summarized result for the overall machine learning methods. Another meta-analysis study focusing on deep learning reported good diagnostic accuracy for breast cancer detection using a mammogram, US, MRI and DBT with pooled AUCs of 0.87, 0.91, 0.87 and 0.91, respectively [71]. However, several meta-analysis studies that assessed the diagnostic accuracy of machine learning models on MRI in gliomas, prostate cancer and meningioma reported slightly lower AUCs of 0.88, 0.86 and 0.75, respectively [72,73,74]. This study included all previous studies that used any machine learning algorithms on mammography for breast cancer detection. In brief, the findings of our study support the promising potential use of machine learning on mammographic data for breast cancer detection in clinical settings, especially as a screening tool and a supplementary diagnostic tool to a radiologist. Inconsistency among the diagnostic accuracy studies is to be expected [22]. In this meta-analysis, the three covariates that may explain the inconsistency among the studies were country, source and classifier. In terms of country, studies that used data from the USA and the UK had higher AUCs compared to the other countries (others group); however, only a pairwise comparison of the USA and other countries revealed a statistically significant result. This significant result may indicate a difference in characteristics between patients with breast cancer across countries. For example, breast cancer presentation and breast density had been reported to vary across populations [75,76], which, in turn, could affect the diagnostic accuracy of machine learning models. Additionally, this study found that studies that used primary data had lower AUCs compared to studies that used secondary databases. The studies that used primary data may reflect the actual diagnostic accuracy of machine models in real practice, as the data were collected specifically for the studies in question. Lastly, this study found that the classifier with the best AUC was the neural network, followed by the tree-based classifier and deep learning. However, the confidence regions of all these three models overlapped with each other (Figure 5), which indicated that none of the machine learning models significantly outperformed the other in terms of breast cancer classification. It is worth noting that one of the findings of this study was that the Bayes-based machine learning model had the lowest AUC (0.69) and performed significantly worse than the neural network, tree-based model and support vector machine. Nevertheless, a few studies were dropped in each subgroup analysis due to a small number of studies in that particular group, which limited the pairwise comparison that could be performed in each subgroup analysis. In brief, the subgroup analysis in this study showed that most machine learning models, such as the neural network (AUC = 0.938), deep learning (AUC = 0.918), tree-based models (0.934) and SVM (AUC = 0.904), perform well with mammographic data for breast cancer detection. Additionally, future studies should note that the characteristics and the quality of the mammographic data influence the performance of machine learning for breast cancer detection. Despite the good performance of machine learning on mammography to be utilized for breast cancer detection, several considerations should be noted. Only 31% of the studies included in this meta-analysis used primary data collected by the researchers themselves, while the remaining 69% of the studies used publicly available datasets, such as MIAS, mini-MIAS and DDSM. Thus, future studies should focus on using high-quality data collected from the hospitals or research centers with a wide range of women with varying clinical symptoms of breast cancer. Furthermore, future studies should explicitly elucidate the role of machine learning tools that they develop either as screening, diagnostic or prognostic tools. Different roles of machine learning tools have different clinical impacts in the implementation of the tools. For example, machine learning screening tools should aim to reduce false-negative cases. Misdiagnosing a case with a high probability of breast cancer to a normal case is a fatal error. However, machine learning diagnostic tools should aim to reduce false-positive cases. Misdiagnosing a normal case as a breast cancer case will lead to unnecessary procedures, especially if it is an invasive procedure, such as a biopsy. Being transparent about where the machine learning tools can be implemented in the context of the clinical pathway of the disease increases the confidence of the clinicians in its utilization in the clinical setting. Nonetheless, there are many opportunities and benefits for the implementation of machine learning in breast cancer detection using mammographic data. The utilization of machine learning in breast cancer detection will reduce the workload of clinicians and accelerate the diagnosis workflow of the disease. Thus, breast cancer patients will receive early treatment, which further reduces the mortality rate of the disease. In this study, we established the good performance of machine learning models on mammography in the classification of breast cancer. We used the bivariate model to estimate the AUC and further applied a bootstrap method to estimate its confidence interval. Furthermore, our meta-analysis included a reasonable number of studies to provide a relatively reliable result on the primary outcome and secondary outcomes. However, our study had several limitations. Firstly, we found that our study had a potential publication bias. One of the probable causes was the unpublished studies with a low-performance model. Additionally, the overall model in this study had a moderate amount of heterogeneity, and this study included a considerable number of studies that may contribute to both the occurrence of publication bias and the high statistical power of the asymmetry test. As shown in Figure 6, model 10 had a much higher DOR compared to the other models on the right side of the figure; however, removing this model did not have a significant impact on the AUC (Supplementary Table S2). Nonetheless, the mechanism of publication bias in diagnostic accuracy studies remains unclear, and a robust assessment of this bias is yet to be proposed [33]. Future meta-analyses may consider including the preprint articles that may be able to reduce the publication bias. Secondly, we only had one study with tomosynthesis, while the rest of the studies used digital mammograms. Thus, the findings of our study were more inclined toward digital mammograms than tomosynthesis, although both are considered mammography technology. In addition, we limited the language of the included studies to English, which may have increased the risk of bias in our findings. Lastly, there are a wide variety of machine learning models with different variants and parameters available. Thus, our study was not able to compare each of the model variants, due to the lack of sample size of that particular model.

5. Conclusions

In conclusion, the performance of machine learning on mammography in breast cancer classification showed promising results, with good sensitivity and specificity values. However, the role of any machine learning technique in the diagnostic pathway should be clearly explained in a diagnostic accuracy study to be efficiently incorporated into the clinical setting. Thus, the limitation of each machine learning model will be apparent to clinicians and other health personnel.
  44 in total

1.  Synthesizing Evidence from Diagnostic Accuracy TEsts: the SEDATE guideline.

Authors:  A Sotiriadis; S I Papatheodorou; W P Martins
Journal:  Ultrasound Obstet Gynecol       Date:  2016-02-10       Impact factor: 7.299

2.  Detection of mass regions in mammograms by bilateral analysis adapted to breast density using similarity indexes and convolutional neural networks.

Authors:  João Otávio Bandeira Diniz; Pedro Henrique Bandeira Diniz; Thales Levi Azevedo Valente; Aristófanes Corrêa Silva; Anselmo Cardoso de Paiva; Marcelo Gattass
Journal:  Comput Methods Programs Biomed       Date:  2018-01-11       Impact factor: 5.428

Review 3.  Machine learning techniques for breast cancer computer aided diagnosis using different image modalities: A systematic review.

Authors:  Nisreen I R Yassin; Shaimaa Omran; Enas M F El Houby; Hemat Allam
Journal:  Comput Methods Programs Biomed       Date:  2017-12-12       Impact factor: 5.428

Review 4.  A short guide for medical professionals in the era of artificial intelligence.

Authors:  Bertalan Meskó; Marton Görög
Journal:  NPJ Digit Med       Date:  2020-09-24

5.  Classification of Breast Masses Using a Computer-Aided Diagnosis Scheme of Contrast Enhanced Digital Mammograms.

Authors:  Gopichandh Danala; Bhavika Patel; Faranak Aghaei; Morteza Heidari; Jing Li; Teresa Wu; Bin Zheng
Journal:  Ann Biomed Eng       Date:  2018-05-10       Impact factor: 3.934

6.  QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies.

Authors:  Penny F Whiting; Anne W S Rutjes; Marie E Westwood; Susan Mallett; Jonathan J Deeks; Johannes B Reitsma; Mariska M G Leeflang; Jonathan A C Sterne; Patrick M M Bossuyt
Journal:  Ann Intern Med       Date:  2011-10-18       Impact factor: 25.391

Review 7.  Artificial intelligence methods for the diagnosis of breast cancer by image processing: a review.

Authors:  Farahnaz Sadoughi; Zahra Kazemy; Farahnaz Hamedan; Leila Owji; Meysam Rahmanikatigari; Tahere Talebi Azadboni
Journal:  Breast Cancer (Dove Med Press)       Date:  2018-11-30

8.  Diagnostic test accuracy: application and practice using R software.

Authors:  Sung Ryul Shim; Seong-Jang Kim; Jonghoo Lee
Journal:  Epidemiol Health       Date:  2019-03-28

Review 9.  Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis.

Authors:  Ravi Aggarwal; Viknesh Sounderajah; Guy Martin; Daniel S W Ting; Alan Karthikesalingam; Dominic King; Hutan Ashrafian; Ara Darzi
Journal:  NPJ Digit Med       Date:  2021-04-07

10.  Mammography screening reduces rates of advanced and fatal breast cancers: Results in 549,091 women.

Authors:  Stephen W Duffy; László Tabár; Amy Ming-Fang Yen; Peter B Dean; Robert A Smith; Håkan Jonsson; Sven Törnberg; Sam Li-Sheng Chen; Sherry Yueh-Hsia Chiu; Jean Ching-Yuan Fann; May Mei-Sheng Ku; Wendy Yi-Ying Wu; Chen-Yang Hsu; Yu-Ching Chen; Gunilla Svane; Edward Azavedo; Helene Grundström; Per Sundén; Karin Leifland; Ewa Frodis; Joakim Ramos; Birgitta Epstein; Anders Åkerlund; Ann Sundbom; Pál Bordás; Hans Wallin; Leena Starck; Annika Björkgren; Stina Carlson; Irma Fredriksson; Johan Ahlgren; Daniel Öhman; Lars Holmberg; Tony Hsiu-Hsi Chen
Journal:  Cancer       Date:  2020-05-11       Impact factor: 6.860

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.