Literature DB >> 33889025

Machine Learning Models to Improve the Differentiation Between Benign and Malignant Breast Lesions on Ultrasound: A Multicenter External Validation Study.

Ling Huo¹, Yao Tan², Shu Wang³, Cuizhi Geng⁴, Yi Li⁵, XiangJun Ma⁶, Bin Wang², YingJian He¹, Chen Yao^2,7, Tao Ouyang¹.

Abstract

PURPOSE: This study aimed to establish and evaluate the usefulness of a simple, practical, and easy-to-promote machine learning model based on ultrasound imaging features for diagnosing breast cancer (BC).
MATERIALS AND METHODS: Logistic regression, random forest, extra trees, support vector, multilayer perceptron, and XG Boost models were developed. The modeling data set of 1345 cases was from a tertiary class A hospital in China. The external validation data set of 1965 cases were from 3 tertiary class A hospitals and 2 primary hospitals. The area under the receiver operating characteristic curve (AUC) was used as the main evaluation index, and pathological biopsy was used as the gold standard for evaluating each model. Diagnostic capability was also compared with that of clinicians.
RESULTS: Among the six models, the logistic model showed superior diagnostic efficiency, with an AUC of 0.771 and 0.906 and Brier scores of 0.181 and 0.165 in the test and validation sets, respectively. The AUCs of the clinician diagnosis and the logistic model were 0.913 and 0.906. Their AUCs in the tertiary class A hospitals were 0.915 and 0.915, respectively, and were 0.894 and 0.873 in primary hospitals, respectively.
CONCLUSION: The externally validated logical model can be used to distinguish between malignant and benign breast lesions in ultrasound images. Compared with clinician diagnosis, the logistic model has better diagnostic efficiency, making it potentially useful to assist in screening, particularly in lower level medical institutions. TRIAL REGISTRATION: http://www.clinicaltrials.gov. ClinicalTrials.gov ID: NCT03080623.

Entities: Chemical

Keywords: breast cancer; diagnostic accuracy; machine learning; patient stratification; screening modalities; ultrasound imaging

Year: 2021 PMID： 33889025 PMCID： PMC8057795 DOI： 10.2147/CMAR.S297794

Source DB: PubMed Journal: Cancer Manag Res ISSN： 1179-1322 Impact factor: 3.989

Introduction

Breast cancer (BC) is the most common malignancy among women worldwide.1 However, most BC patients in China are diagnosed at the advanced stage.2 BC screening for early diagnosis is crucial for improving treatment efficacy and survival.3 BC screening currently includes breast self-examination, mammography, ultrasonography, exfoliative cytology, carcinoembryonic antigen, and a carbohydrate antigen 153 test.4 However, these traditional methods have limited application value in early diagnosis due to their lack of sensitivity and/or specificity. The emergence of new biomarkers, such as MicroRNAs,5–9 lipocalin-1,10 APC gene promoter aberrant methylation,11 14-3-3 sigma (σ) promoter methylation,12 and circulating tumor DNA, makes early BC screening promising.13,14 A recent study on the diagnostic accuracy of seven BC markers found that miRNA has better diagnostic accuracy than do other markers.5 However, although liquid biopsy for tracking new markers is promising, it is not suitable for large-scale screening in areas with scarce medical resources because of its invasiveness and cost. In Beijing, the primary method of BC screening is breast ultrasound imaging examination. However, given that the accuracy of conventional ultrasound imaging is highly dependent on the clinicians’ expertise and experience, the results of BC screening and diagnosis in primary hospitals are suboptimal. In oncology, machine learning models play an important role in developing new auxiliary tools for clinicians.15–17 Therefore, a model for diagnosing breast lesions based on the characteristics of large samples of ultrasound images may be helpful for lowering subjectivity and improving the accuracy of screening. Computer-aided recognition methods based on technologies such as image segmentation and machine learning have been found to improve the diagnosis of BC.18–23 However, these advanced auxiliary screening technologies and the use of artificial intelligence medical ultrasound equipment are still in the early phase of development. This study aimed to establish a simple, practical, and easy-to-promote clinical model for BC diagnosis and evaluate its usefulness in primary hospitals. Towards this goal, we screened out meaningful predictors based on the data collected by tertiary class A hospitals and established diagnostic models. Population data, including from primary hospitals, were used as an external verification data set to validate the effectiveness of the model and explore its applicability and clinical potential. We ultimately aimed to extend the BC screening experience of skilled clinicians to lower level medical institutions in the form of predictive models, so as to improve the overall quality of screening across the country.

Patients and Methods

Data Sets

The modeling data set was a cumulative collection of data from 1345 patients admitted to a tertiary class A hospital (Beijing Cancer Hospital) between November 2010 and May 2016. We used the automated breast ultrasound screening (ABUS) in this study. Data on ultrasound findings and histopathological diagnosis were collected. For early tumor detection, we selected T1 BC patients. T1 was defined as tumor lesions smaller than 2 cm, and thus the maximum diameter of the ultrasound image of the lesion was set to be less than 2 cm. Two-dimensional images were collected, and the coronal image was reconstructed. After re-evaluation by professional clinicians from Beijing People’s Hospital, the cases with consistent findings were selected as the final modeling data set. In total, data from 1125 patients were included; of them, 732 patients had malignant tumors. Given that our model was aimed to assist clinicians in primary hospitals in tumor screening, we included some primary hospitals in the selection of the external validation set to test the generalizability of the model. The external validation data set was from 3 tertiary class A hospitals (Beijing Cancer Hospital, Beijing People’s Hospital, Fourth Hospital of Hebei Medical University) and 2 primary hospitals (Beijing Shunyi District Maternity and Child Health Hospital, and Beijing Haidian District Maternity and Child Health Hospital). The data were cumulatively collected from August 2017 to December 2019 and comprised pathological results of 1981 biopsy (n=1094) or follow-up (n=890) cases. After data cleaning, 1965 cases were included in the verification data set. The dependent variable of the machine model was the diagnosis result (benign or malignant) of biopsy cases with pathological biopsy classification or follow-up cases with disease classification. The independent variable was the expert group and modeling working group classification from Peking University Cancer Hospital and Peking University People’s Hospital. This working group extracted and clarified the definitions of ultrasound imaging terminology based on the interpretation of ultrasound images in a blinded manner. We have previously published relevant literature24 using the full model strategy, logistic model strategy, and random forest model strategy to screen independent variables and establish models (Table 1).

Table 1

AUC of the Two Models in Our Previous Study24

Strategies	Logistic Regression (95% CI)	Random Forest (95% CI)
Full models	0.7812 (0.7325–0.8298)	0. 7878(0.7392–0.8365)
Logistic	0.7727 (0.7227–0.8227)	0. 7757 (0.7258–0.8255)
Random forest	0.7880 (0.7395–0.8364)	0. 7868 (0.7377–0.8359)

AUC of the Two Models in Our Previous Study24 The external validation data set comprised only part of the screening independent variables that need to be validated based on the previous models. The identifiable information of the boundary was classified into 4 features when the boundary was not identifiable. The specific variable assignments are shown in Table 2.

Table 2

Variable Assignment

Variables	Name	Value
Breast left/right	zyc	0-left, 1-right
Direction	FX	0- parallel, 1-unparallel
Margins blur	bqxcd1	0-identifiable, 1-non‐identifiable but no blur, 2-non‐identifiable and blurred
Margins angulation	bqxcd2	0-identifiable, 1-non‐identifiable but no angulation, 2-non‐identifiable and angled
Margins microlobulation	bqxcd3	0-identifiable, 1-non‐identifiable but no microlobulation, 2-non‐identifiable and microlobulated
Margin burr	bqxcd4	0-identifiable, 1-non‐identifiable but no burr, 2-non‐identifiable and burr
Posterior echoes	hfhs	0-no change, 1-enhanced, 2- attenuated (include mixed)
Surrounding tissue edema	shuiz	0-no, 1-yes
Benign vs malignant	End	0- benign, 1-malignant
Clinicians	biras	0-benign tendency (follow-up), 1- malignant tendency (biopsy)
Biopsy results	Path	0- benign, 1-malignant
Follow-up results	Path3	0- benign, 1-malignant

Variable Assignment This study was approved by the Ethics Committee of Beijing Cancer Hospital (Approval Number: 2016KT14) in Beijing, China and was conducted according to the tenets of the Declaration of Helsinki. All patients provided written informed consent to participate.

Model Development

The data set was divided into a modeling data set and an external verification dataset. We selected 75% of the samples from the modeling data set as the training set. The variable selection, one-hot encoding, and basic model were assembled into a pipeline, which was entered into the grid search, using the 10-fold cross validation technique. In this technique, the data set was divided 10 folds, and each fold was used for internal verification. The remaining 90% was used for the training of the development model. The hyperparameter adjustment was used for establishing the model. Otherwise, we validated the models with the remaining 25% of the samples and external validation data sets. Cross-validation and hyperparameter adjustments for internal validation are considered robust methods of model evaluation before external validation on a separate data set. This could maximize the potential performance of machine learning models. We validated each model through an external verification data set. The discriminative capability of each model was validated using the area under the receiver operating characteristic (ROC) curve. Meanwhile, the Brier score was calculated to quantify the calibration degree of the model, and a calibration degree scatter diagram was created thereafter. We then evaluated the consistency of the actual observations and models according to the comparison between the scattered point distribution and the reference line. The verification data were stratified according to primary hospitals and tertiary class A hospitals to compare between each model and the results determined by clinicians.

Statistical Analysis

Raw data were cleaned using SAS v.9.4 (SAS Institute, Cary, NC), and a single factor analysis was performed. The categorical independent and dependent variables were evaluated using chi-square test. P values less than 0.05 on both sides were considered statistically significant. The verification process was mainly based on the “sklearn” package (version 0.22.2.post1) of Python (version 3.7.7). The model’s discriminative capability was evaluated according to the area under the curve (AUC). The AUC value ranges from 0.5 to 1, and the closer the AUC is to 1, the better the discriminative capability of the model. An AUC of 0.5 indicates that the model is not predictive and has no practical application. We evaluated the model calibration using the Brier score and calibration curve. The Brier score is calculated using the formula (Y-p)2, where Y is the actually observed outcome variable (0 or 1), and p is the predicted probability based on the prediction model. The Brier score ranges from 0 to 0.25, and the smaller the score, the better the calibration of the model. A Brier score of 0.25 indicates that the model has no predictive capability.

Results

Basic Information

The modeling data set included data from 732 cases of malignant tumors (65.07%) and 393 cases of benign tumors (34.93%). Meanwhile, the validation data set included data from 498 cases of malignant tumors (25.34%) and 1467 cases of benign tumors (74.66%). With respect to clinician findings in the validation data set, 1354 follow-up cases (68.91%) and 611 biopsy cases (31.09%) were determined to be malignant, respectively. Pathological examination of the biopsy cases revealed 498 malignant tumors (45.69%) and 592 benign tumors (54.31%). All follow-up cases were benign tumors (100%) on pathological examination. Comparison of the predictive variables between the modeling data set and validation data set showed a significant difference in the distribution of these predictors (P<0.001, Table 3).

Table 3

Comparison Between the Modeling Data Set and the Validation Data Set

Variables		Modeling Data Set (n=1125)	Validation Data Set (n=1965)	χ²	P
Zyc	Left, n (%)	0 (0.00%)	942 (47.94%)	–	–
	Right, n (%)	0 (0.00%)	1023 (52.06%)
FX	Parallel	826 (73.42%)	1566 (79.69%)	16.096	0.000
	Unparallel	299 (26.58%)	399 (20.31%)
Bqxcd1	Identifiable	160 (14.22%)	1074 (54.66%)	609.309	0.000
	Non‐identifiable but no blur	80 (7.11%)	240 (12.21%)
	Non‐identifiable and blurred	885 (78.67%)	651 (33.13%)
Bqxcd2	Identifiable	160 (14.22%)	1073 (54.61%)	504.371	0.000
	Non‐identifiable but no angulation	525 (46.67%)	401 (20.41%)
	Non‐identifiable and angled	440 (39.11%)	491 (24.99%)
Bqxcd3	Identifiable	160 (14.22%)	1073 (54.61%)	629.396	0.000
	Non‐identifiable but no microlobulation	363 (32.27%)	574 (29.21%)
	Non‐identifiable and microlobulated	602 (53.51%)	318 (16.18%)
Bqxcd4	Identifiable	160 (14.22%)	1074 (54.66%)	497.430	0.000
	Non‐identifiable but no burr	720 (64.00%)	717 (36.49%)
	Non‐identifiable and burr	245 (21.78%)	174 (8.85%)
hfhs	No change	687 (61.07%)	1549 (78.83%)	114.225	0.000
	Enhanced	198 (17.60%)	204 (10.38%)
	Attenuated (including mixed)	240 (21.33%)	212 (10.79%)
shuiz	No	1079 (95.91%)	1823 (92.77%)	12.326	0.000
	Yes	46 (4.09%)	142 (7.23%)
End	Benign	393 (34.93%)	1467 (74.66%)	471.132	0.000
	Malignant	732 (65.07%)	498 (25.34%)

Note: The values are presented in n (%).

Comparison Between the Modeling Data Set and the Validation Data Set Note: The values are presented in n (%).

Comparison Between Benign and Malignant Tumors

Univariate analysis of the independent variables in the validation data set identified seven predictors, namely, direction, margin blur, margin angulation, margin microlobulation, margin burr, posterior echoes, and surrounding tissue edema. Further, their distribution was significantly different between the benign and malignant groups (P<0.001, Table 4). Representative ultrasound images showing malignant breast lesions are shown in Figure 1.

Table 4

Comparison Between the Benign and Malignant Groups in the Validation Set

Variables		Benign (n=1467)	Malignant (n=498)	χ²	P-value
Zyc	Left	1352 (92.16%)	214 (42.97%)	555.895	0.000
	Right	115 (7.84%)	284 (57.03%)
FX	Parallel	1040 (70.89%)	34 (6.83%)	656.956	0.000
	Unparallel	152 (10.36%)	88 (17.67%)
Bqxcd1	Identifiable	275 (18.75%)	376 (75.50%)
	Non‐identifiable but no blur	1040 (70.89%)	33 (6.63%)	657.869	0.000
	Non‐identifiable and blurred	232 (15.81%)	169 (33.94%)
Bqxcd2	Identifiable	195 (13.29%)	296 (59.44%)
	Non‐identifiable but no angulation	1040 (70.89%)	33 (6.63%)	679.549	0.000
	Non‐identifiable and angled	323 (22.02%)	251 (50.40%)
Bqxcd3	Identifiable	104 (7.09%)	214 (42.97%)
	Non‐identifiable but no microlobulation	1040 (70.89%)	34 (6.83%)	808.091	0.000
	Non‐identifiable and microlobulated	415 (28.29%)	302 (60.64%)
Bqxcd4	Identifiable	12 (0.82%)	162 (32.53%)
	Non‐identifiable but no burr	1271 (86.64%)	278 (55.82%)	231.661	0.000
	Non‐identifiable and burr	116 (7.91%)	88 (17.67%)
hfhs	No change	80 (5.45%)	132 (26.51%)
	Enhanced	1440 (98.16%)	383 (76.91%)	250.462	0.000
	Attenuated (include mixed)	27 (1.84%)	115 (23.09%)

Note: The values are presented in n (%).

Figure 1

Representative ultrasound images showing malignant breast lesions. (A) A hypoechoic malignant lesion with irregular shape, calcification (thick arrow), and not circumscribed margin thin arrow). (B) A hypoechoic lesion with an oval shape, circumscribed margins (thin arrow), and enhancement posterior features (thick arrow). (C) A heterogeneous, hypoechoic structural disordered area with irregular shape and parallel orientation characteristic.

Comparison Between the Benign and Malignant Groups in the Validation Set Note: The values are presented in n (%). Representative ultrasound images showing malignant breast lesions. (A) A hypoechoic malignant lesion with irregular shape, calcification (thick arrow), and not circumscribed margin thin arrow). (B) A hypoechoic lesion with an oval shape, circumscribed margins (thin arrow), and enhancement posterior features (thick arrow). (C) A heterogeneous, hypoechoic structural disordered area with irregular shape and parallel orientation characteristic.

Discriminative Capability of the Machine Learning Models

The degree of discrimination was used to evaluate the discriminative and ranking capabilities of the model, which indicate the model’s capability to distinguish between individuals with and without the end-point events. In the internal verification, there were no significant differences in the results of several models after hyperparameter adjustment. The multilayer perceptron model performed best, with an AUC (95% CI) of 0.775 (0.719–0.832). In the external verification, the logistic regression model performed best after hyperparameter adjustment, with an AUC (95% CI) of 0.906 (0.892–0.921). The model performance in the verification set was generally better than that in the test set. The indicators of each model are shown in Table 5, and the ROC curves are shown in Figure 2.

Table 5

Performance Evaluation of the Different Models

Model	Accuracy	Precision Class 1	Recall Class 1	AUC of ROC	AUC of PRC	F1 Score
Test set (calibration model)
Logistic regression	0.720	0.734	0.891	0.771	0.846	0.805
Random forest	0.727	0.755	0.858	0.747	0.812	0.803
Extra trees	0.723	0.754	0.852	0.746	0.820	0.800
Support vector	0.709	0.717	0.913	0.638	0.736	0.803
Multilayer Perceptron	0.738	0.756	0.880	0.775	0.838	0.813
XG Boost	0.713	0.730	0.885	0.769	0.839	0.800
Validation set (calibration model)
Logistic regression	0.772	0.528	0.936	0.906	0.794	0.675
Random forest	0.814	0.598	0.813	0.865	0.735	0.689
Extra trees	0.813	0.597	0.807	0.855	0.709	0.687
Support vector	0.768	0.524	0.936	0.852	0.632	0.671
Multilayer Perceptron	0.818	0.596	0.869	0.901	0.792	0.708
XG Boost	0.781	0.542	0.876	0.898	0.776	0.669

Figure 2

ROC plots of the calibrated model in the test set (A) and validation set (B).

Performance Evaluation of the Different Models ROC plots of the calibrated model in the test set (A) and validation set (B).

Calibration of the Machine Learning Models

Compared with discrimination, calibration pays more attention to the accuracy of the absolute risk prediction value of the model, that is, the consistency between the probability of the outcome predicted by the model and the probability of the actual outcome. In the internal verification, the Brier scores of the logistic regression, random forest, extra trees, support vector, multilayer perceptron, and XGBoost were 0.181, 0.189, 0.196, 0.199, 0.177, and 0.179, respectively. In the external verification, logistic regression, random forest, extra trees, support vector, multilayer perceptron, and XGBoost were 0.165, 0.163, 0.170, 0.178, 0.146, and 0.161, respectively. The calibration curves are shown in Figure 3.

Figure 3

Calibration plots of the calibrated model in the test set (A) and validation set (B).

Comparison of Outcomes Between Clinician and Models

We compared the predicted outcome of the models with those determined by clinicians according to the center stratification (Table 6). Overall, clinician diagnosis showed a higher accuracy than did model diagnosis. The clinician diagnosis had an accuracy of 0.906; sensitivity, 0.928; specificity, 0.898; and AUC, 0.913. Meanwhile, the accuracy of clinician diagnosis in primary hospitals was 0.905; the AUC was 0.894, respectively. The accuracy of clinician diagnosis in the tertiary class A hospitals was 0.906; the AUC was 0.915. When comparing clinician diagnosis between primary and tertiary class A hospitals, the sensitivity was higher in the tertiary class A hospitals, while the accuracy, specificity, and AUC were lower than those in the primary hospitals. Further, we found that each model had a better predictive performance among patients in primary hospitals than those in tertiary class A hospitals (Logistic regression model AUC: 0.915 vs 0.873, Table 7). The performance of the logistic regression model is shown in Table 8.

Table 6

Comparison Between Clinician Diagnosis and Gold Standard Diagnosis

Clinician		Gold Standard		Total
Clinician		Benign	Malignant	Total
All validation set	Benign	1318	36	1354
	Malignant	149	462	611
	Total	1467	498	1965
Primary hospitals	Benign	535	11	546
	Malignant	54	81	135
	Total	589	92	681
Tertiary class A hospitals	Benign	783	25	808
	Malignant	95	381	476
	Total	878	406	1284

Table 7

Comparison Between Clinician and Model Diagnosis

Model	Accuracy	Precision Class 1	Recall Class 1	AUC of ROC	AUC of PRC	F1 Score	Threshold	FPR	TPR
Full validation set
Clinicians	0.906	0.756	0.927	0.913	0.851	0.833	–	–	–
Logistic regression	0.772	0.528	0.936	0.906	0.794	0.675	0.571	0.181	0.829
Random Forest	0.814	0.598	0.813	0.865	0.735	0.689	0.491	0.185	0.815
Extra Trees	0.813	0.597	0.807	0.855	0.709	0.687	0.505	0.185	0.807
Support vector	0.768	0.524	0.936	0.852	0.632	0.671	0.710	0.206	0.793
Multilayer perceptron	0.818	0.596	0.869	0.901	0.792	0.708	0.573	0.187	0.827
XG Boost	0.781	0.542	0.876	0.898	0.776	0.669	0.557	0.183	0.817
Tertiary class A hospitals
Clinicians	0.906	0.790	0.932	0.915	0.874	0.855	–	–	–
Logistic regression	0.798	0.618	0.941	0.915	0.839	0.746	0.584	0.155	0.833
Random forest	0.798	0.641	0.825	0.861	0.778	0.721	0.565	0.198	0.788
Extra trees	0.795	0.638	0.813	0.850	0.750	0.715	0.548	0.213	0.796
Support vector	0.793	0.612	0.941	0.851	0.687	0.742	0.712	0.210	0.791
Multilayer perceptron	0.807	0.643	0.877	0.903	0.829	0.742	0.573	0.210	0.837
XG Boost	0.792	0.621	0.877	0.900	0.816	0.727	0.581	0.208	0.828
Primary hospitals
Clinicians	0.905	0.683	0.918	0.894	0.807	0.784	–	–	–
Logistic regression	0.797	0.388	0.870	0.873	0.544	0.537	0.584	0.199	0.783
Random forest	0.747	0.321	0.783	0.771	0.446	0.456	0.627	0.246	0.739
Extra trees	0.746	0.318	0.772	0.766	0.409	0.451	0.644	0.251	0.750
Support vector	0.717	0.314	0.924	0.797	0.304	0.468	0.696	0.246	0.750
Multilayer perceptron	0.749	0.329	0.826	0.860	0.578	0.471	0.715	0.248	0.750
XG Boost	0.725	0.309	0.837	0.836	0.481	0.452	0.587	0.243	0.750

Table 8

Performance of the Logistic Regression Model

	B	SE	OR	95% CI	P	β
fx	1.454	0.165	4.281	3.098–5.917	<0.001	0.322239
bqxcd1	0.235	0.143	1.265	0.956–1.674	0.100	0.118155
bqxcd2	0.334	0.142	1.396	1.058–1.844	0.019	0.155041
bqxcd3	0.716	0.154	2.047	1.513–2.768	<0.001	0.295653
bqxcd4	1.184	0.247	3.267	2.013–5.303	<0.001	0.425586
hfhs	0.340	0.101	1.405	1.152–1.714	0.001	0.123337
shuiz	1.193	0.269	3.298	1.947–5.586	<0.001	0.170345

Comparison Between Clinician Diagnosis and Gold Standard Diagnosis Comparison Between Clinician and Model Diagnosis Performance of the Logistic Regression Model

Model Risk Probability Distribution

Our models enabled the prediction of BC and can thus be used by clinicians to make appropriate patient management decisions. As shown in Figure 4, the predictive capability of the models ranged from 0.2 to 0.4. We analyzed the model prediction probabilities according to 1%, 2%, 5%, 10%, 50%, 90%, 95%, 98%, and 99% and applied the logistic model in the clinic for preliminary evaluation of BC (Table 9).

Figure 4

Probability distribution by model.

Table 9

Predicted Probability of Different Proportions of People by Model

	Logistic Regression	Random Forest	Extra Trees	Support Vector	Multilayer Perceptron	XG Boost
1%	0.2158926	0.0870467	0	0.2690289	0.1223317	0.1271728
2%	0.2481656	0.2063348	0.1830000	0.2690872	0.1924786	0.2399745
5%	0.2953400	0.2432472	0.2500000	0.2691355	0.2032477	0.2851146
10%	0.2953400	0.2826738	0.2857143	0.2691355	0.2580033	0.2999176
50%	0.2953400	0.2826738	0.2857143	0.2691355	0.2580033	0.2999176
90%	0.8769365	0.8999733	0.9291429	0.7422661	0.8754747	0.8494976
95%	0.9327307	0.9831579	1	0.7428197	0.9669854	0.9255747
98%	0.9648594	1	1	0.7554798	0.9834885	0.9730366
99%	0.9675776	1	1	0.7882681	0.9877369	0.9751260

Predicted Probability of Different Proportions of People by Model Probability distribution by model.

Discussion

Breast Cancer Screening Deficits

The increasing incidence of BC, which is primarily related to overdiagnosis and treatment, and the possibility of cancer omissions indicate the need for changes in BC screening procedures. Harkness’s review provides a detailed overview of risk-based BC screening strategies for women.25 Most cancer screening strategies primarily use mammography. However, its sensitivity in women with dense breast tissue is only 47.8–64.4%,26 limiting its benefit in this population. ABUS examination is an important screening method due to its safety and relatively low cost, especially in women with dense breast tissue. However, it is limited by its reliance on operators and high recall rates. High-level evidence on supplemental ultrasound is currently scarce.27 In a previous population-based cancer screening program in China, the overall proportion of positive ultrasound examinations was only 13.51% for high-risk women with BC.28

Advantages of Our Study

Many studies have reported advances in BC prediction models.29–33 However, previous predictive models based on the features of conventional ultrasound images of breast tumors provided limited value due to the small sample size used for modeling and lack of external verification.34–36 To the best of our knowledge, this is the first large-sample, multi-center, externally validated predictive model study that focuses on the use of ultrasound image features for BC screening.

Predictors of Breast Cancer

Based on our previous study that initially identified 27 independent variables,24 we selected 7 independent variables to develop six machine learning models for BC diagnosis. In our logistic regression model, tumor margin burr and the direction of tumor growth had a relatively profound impact on the differentiation between benign and malignant tumors. The odds ratio (OR) were 3.267 (2.013–5.303) and 4.281 (3.098–5.917), respectively (Table 8). This is consistent with the findings reported by Chhatwal et al37 that the most important predictors associated with BC as identified by this model were spiculated mass margins. Direction of tumor growth, non-identifiable and burr at the margins, and edema of the surrounding tissue showed the highest OR values. This indicated that non-parallel growth, non-identifiable margin burr, and edema of the surrounding tissue are the most important factors for predicting malignant BC. Wang et al also showed that axillary lymphadenopathy is indicative of the probability of metastasis in BC.38

Performance of the Predictive Models Compared to Those in Previous Studies

he average AUCs of the models in the test and validation sets were 0.741±0.052 and 0.880±0.025, respectively. At a threshold of 0.571, the logistic model achieved 82.9% sensitivity and 81.9% specificity in the validation set. The overall performance of the model in the validation set was better than that in the test set. Compared with internal verification, external verification is more concerned with model transportability and generalizability. Thus, we believe that the predictive model can be applied generally across population samples and has good promotion significance. Guo et al used 4 ultrasound image features to develop a logistic model of BC recurrence risk, with an AUC of 0.801.34 Gao et al conducted a multi-center study in China that combined the Gail model and the Breast Imaging Reporting and Data System (BI-RADS) category to differentiate malignant and benign breast lesions. The results showed that their combination achieved higher accuracy than did each model alone.39 When compared with clinician diagnosis, the logistic regression model showed lower accuracy (0.906 vs 0.772) and AUC (0.913 vs 0.906). When model performance was evaluated by type of hospital (tertiary class A hospitals and primary hospitals), the model performed better in primary hospitals than it did in tertiary class A hospitals. This may be due to the different distribution of benign and malignant tumors in both groups. The proportion of benign tumor patients was significantly higher in primary hospitals (n=892, 85.93%) than that in tertiary class A hospitals (n=575, 62.02%). For complex malignant tumors, predictions based on models alone is more likely to be biased. In primary hospitals, the accuracy of clinician diagnosis was higher than that of the logistic model (0.929 vs 0.806), and the AUC of clinician diagnosis was also slightly higher (0.913 vs 0.906). Similarly, the accuracy of clinician diagnosis in tertiary class A hospitals was higher than that of the logistic model (0.880 vs 0.734). The AUC of clinician diagnosis was also slightly higher than that of the logistic model (0.890 vs 0.875). The high sensitivity of clinician diagnosis in tertiary class A hospitals indicates that clinicians have a greater probability of accurately diagnosing malignant tumors, and the possibility of missed diagnosis is lower. Meanwhile, the high specificity of clinician diagnosis in primary hospitals indicates that clinicians in these hospitals can accurately diagnose benign tumors, and the possibility of misdiagnosis is lower.

Difference in AUC and Accuracy According to Model Performance Indicators

Although there was no significant difference in AUC between the model and clinician diagnosis, the accuracy seems markedly different. The imbalance in the classification between benign and malignant in the external validation set is an important reason for the low accuracy. The external validation set of 1965 cases included 498 cases of malignant tumors and 1467 cases of benign tumors. For example, this means that by simply all cases are benign, we can already achieve good accuracy: 1467/(1467 +498) = 74.7%. The 77.2% accuracy of the logistic model was calculated at a default threshold of 0.5 in the validation set. When the threshold was 0.571, the logistic model achieved 82.1% accuracy. Thus, we cannot compare the accuracy (a performance at one threshold) with the AUC (an average performance on all possible thresholds). Improper scoring rules such as proportion classified correctly, sensitivity, and specificity are not only arbitrary (in choice of threshold) but are improper. Appropriate scoring rules (Brier score) and c-index (semi-correct scoring rule area under the ROC curve; consistent probability) make us more confident in the correct scoring rules. The AUC is computed by adding all the “accuracies” computed for all the possible threshold values. Meanwhile, ROC is an average (expected value) of those accuracies when they are computed for all threshold values.

Explanation of Logistic Model Performance

Model performance was evaluated according to the AUC. Therefore, increasing the number of samples to obtain a more balanced data set may help improve the accuracy of the model. However, it has little contribution to improving the AUC. In addition, there is an imbalance in the distribution of benign and malignant samples in the real world due to several influencing factors such as tumor prevalence. An external validation of the model enables evaluation under conditions closer to the real world, thus determining its generalizability. Therefore, we did not choose to use a more balanced scale data set for external verification. Considering the shortcomings of the logistic model as a shallow learning, the model based on deep learning with better optimization capabilities for imbalanced categories may easily surpass the logistic model with respect to prediction accuracy to a certain extent. Finally, the diagnostic process involves the consideration of several data and not only on ultrasound images. Our model uses only very limited ultrasound features. Therefore, in theory, the model cannot achieve the high diagnostic efficiency of physicians from tertiary medical centers. However, the AUCs support a similar diagnostic accuracy of our model to that of physician diagnosis, and thus it can be used to distinguish between benign and malignant tumors.

Limitations

This study has some limitations. First, this study was mainly an external verification of the previous model. The independent variable in the model population is different from the verification population, which may cause a selection bias. Second, this study did not modify and improve the model because of the imbalance in the distribution of the predictor variables and classification, and thus the model has low accuracy. Future research should pay attention to selecting some complex models that can optimize the imbalance of sample proportions, such as deep learning, when constructing predictive models. Third, this study did not collect demographic information and baseline patient data, It was difficult to balance the patient baseline in the pre-modeling stage. This may have affected the performance of the model and introduced confounding factors. Future research can consider adding characteristic variables such as demography or building a compound model to improve predictive performance.

Conclusion

Of the six machine learning models, the logistic regression model showed the highest AUC and generalizability, indicating its potential for application in primary hospitals. Compared with clinician diagnosis, the logistic model showed better diagnostic efficiency, supporting its potential for application in BC screening in lower level medical centers.

Expert Recommendations

If the predicted probability in our logistic model was lower than 1% of the population (corresponding to a predicted probability of 0.2158926), it is highly likely that patients do not have to undergo pathological biopsy. Malignancy can be largely ruled out, and the patient can undergo regular follow-up. When the predicted probability is higher than 90% of the population (corresponding to a predicted probability of 0.8769365), it is highly indicative of malignant lesions, and clinicians are required to intervene. Patients should immediately undergo a pathological biopsy to confirm malignancy. For patients whose predicted probabilities are in between these values, a short-term follow-up (within 1 year, preferably 3 to 6 months) can be recommended.40 The clinicians can further use the models to assist in decision-making according to the follow-up outcomes. However, the cut-off value of the predictive probability needs to be verified and calculated in studies with a larger sample size.

37 in total

1. Breast imaging reporting and data system (BI-RADS).

Authors: Laura Liberman; Jennifer H Menell
Journal: Radiol Clin North Am Date: 2002-05 Impact factor: 2.303

2. Identification of LCN1 as a Potential Biomarker for Breast Cancer by Bioinformatic Analysis.

Authors: Yuemei Yang; Feng Li; Xueying Luo; Binghan Jia; Xiaoling Zhao; Baoer Liu; Rui Gao; Liping Yang; Wei Wei; Jinsong He
Journal: DNA Cell Biol Date: 2019-08-19 Impact factor: 3.311

3. Automated Breast Ultrasound Lesions Detection Using Convolutional Neural Networks.

Authors: Moi Hoon Yap; Gerard Pons; Joan Marti; Sergi Ganau; Melcior Sentis; Reyer Zwiggelaar; Adrian K Davison; Robert Marti; Gerard Pons; Joan Marti; Sergi Ganau; Melcior Sentis; Reyer Zwiggelaar; Adrian K Davison; Robert Marti
Journal: IEEE J Biomed Health Inform Date: 2017-08-07 Impact factor: 5.772

Review 4. Supplemental Screening for Breast Cancer in Women With Dense Breasts: A Systematic Review for the U.S. Preventive Services Task Force.

Authors: Joy Melnikow; Joshua J Fenton; Evelyn P Whitlock; Diana L Miglioretti; Meghan S Weyrich; Jamie H Thompson; Kunal Shah
Journal: Ann Intern Med Date: 2016-01-12 Impact factor: 25.391

5. A generic deep learning framework to classify thyroid and breast lesions in ultrasound images.

Authors: Yi-Cheng Zhu; Alaa AlZoubi; Sabah Jassim; Quan Jiang; Yuan Zhang; Yong-Bing Wang; Xian-De Ye; Hongbo DU
Journal: Ultrasonics Date: 2020-11-12 Impact factor: 2.890

6. A logistic regression model based on the national mammography database format to aid breast cancer diagnosis.

Authors: Jagpreet Chhatwal; Oguzhan Alagoz; Mary J Lindstrom; Charles E Kahn; Katherine A Shaffer; Elizabeth S Burnside
Journal: AJR Am J Roentgenol Date: 2009-04 Impact factor: 3.959

Review 7. Detection of 14-3-3 sigma (σ) promoter methylation as a noninvasive biomarker using blood samples for breast cancer diagnosis.

Authors: Meng Ye; Tao Huang; Ying Ying; Jinyun Li; Ping Yang; Chao Ni; Chongchang Zhou; Si Chen
Journal: Oncotarget Date: 2017-02-07

8. Ultrasound for Breast Cancer Screening in High-Risk Women: Results From a Population-Based Cancer Screening Program in China.

Authors: Yong Wang; Hongda Chen; Ni Li; Jiansong Ren; Kai Zhang; Min Dai; Jie He
Journal: Front Oncol Date: 2019-04-24 Impact factor: 6.244

9. A Circulating miRNA Signature for Stratification of Breast Lesions among Women with Abnormal Screening Mammograms.

Authors: Sau Yeen Loke; Prabhakaran Munusamy; Geok Ling Koh; Claire Hian Tzer Chan; Preetha Madhukumar; Jee Liang Thung; Kiat Tee Benita Tan; Kong Wee Ong; Wei Sean Yong; Yirong Sim; Chung Lie Oey; Sue Zann Lim; Mun Yew Patrick Chan; Teng Swan Juliana Ho; Boon Kheng James Khoo; Su Lin Jill Wong; Choon Hua Thng; Bee Kiang Chong; Ern Yu Tan; Veronique Kiak-Mien Tan; Ann Siew Gek Lee
Journal: Cancers (Basel) Date: 2019-11-26 Impact factor: 6.639