Literature DB >> 35698545

Prediction of Breast Cancer using Machine Learning Approaches.

Reza Rabiei¹, Seyed Mohammad Ayyoubzadeh², Solmaz Sohrabei³, Marzieh Esmaeili², Alireza Atashi⁴.

Abstract

Background: Breast cancer is considered one of the most common cancers in women caused by various clinical, lifestyle, social, and economic factors. Machine learning has the potential to predict breast cancer based on features hidden in data. Objective: This study aimed to predict breast cancer using different machine-learning approaches applying demographic, laboratory, and mammographic data. Material and
Methods: In this analytical study, the database, including 5,178 independent records, 25% of which belonged to breast cancer patients with 24 attributes in each record was obtained from Motamed cancer institute (ACECR), Tehran, Iran. The database contained 5,178 independent records, 25% of which belonged to breast cancer patients containing 24 attributes in each record. The random forest (RF), neural network (MLP), gradient boosting trees (GBT), and genetic algorithms (GA) were used in this study. Models were initially trained with demographic and laboratory features (20 features). The models were then trained with all demographic, laboratory, and mammographic features (24 features) to measure the effectiveness of mammography features in predicting breast cancer.
Results: RF presented higher performance compared to other techniques (accuracy 80%, sensitivity 95%, specificity 80%, and the area under the curve (AUC) 0.56). Gradient boosting (AUC=0.59) showed a stronger performance compared to the neural network.
Conclusion: Combining multiple risk factors in modeling for breast cancer prediction could help the early diagnosis of the disease with necessary care plans. Collection, storage, and management of different data and intelligent systems based on multiple factors for predicting breast cancer are effective in disease management. Copyright: © Journal of Biomedical Physics and Engineering.

Entities: Chemical

Keywords: Artificial Intelligence; Breast Cancer; Computing Methodologies; Genetic Algorithm; Machine Learning

Year: 2022 PMID： 35698545 PMCID： PMC9175124 DOI： 10.31661/jbpe.v0i0.2109-1403

Source DB: PubMed Journal: J Biomed Phys Eng ISSN： 2251-7200

Introduction

Breast cancer is considered a multifactorial disease and the most common cancer in women worldwide [ 1 , 2 ] with approximately 30% of all female cancers [ 3 , 4 ] (i.e. 1.5 million women are diagnosed with breast cancer each year, and 500,000 women die from this disease in the world). Over the past 30 years, this disease has increased, while the death rate has decreased. However, the reduction in mortality due to mammography screening is estimated at 20% and improvement in cancer treatment is estimated at 60% [ 5 , 6 ]. Diagnostic mammography can assess abnormal breast cancer tissue in patients with subtle and inconspicuous malignancy signs. Due to a large number of images, this method cannot effectively be used in assessing cancer suspected areas. According to a report, approximately 50% of breast cancers were not detected in screenings of women with very dense breast tissue [ 7 ]. However, about a quarter of women with breast cancer are diagnosed negatively within two years of screening. Therefore, the early and timely diagnosis of breast cancer is crucial [ 8 ]. Most mammography-based breast cancer screening is performed at regular intervals - usually annually or every two years - for all women. This “A fix screening program for everyone” is not effective in diagnosing cancer at the individual level and may impair the effectiveness of screening programs [ 9 ]. On the other hand, experts suggest that considering other risk factors along with mammography screening can help a more accurate diagnosis of women at risk [ 9 - 11 ]. Moreover, effective risk prediction through modeling can not only help radiologists in setting up a personal screening for patients and encouraging them to participate in the program for early detection but also help identify high-risk patients [ 12 , 13 ]. Machine learning, as a modeling approach, represents the process of extracting knowledge from data and discovering hidden relationships [ 14 ], widely used in healthcare in recent years [ 15 ] to predict different diseases [ 16 - 18 ]. Some studies only used demographic risk factors (lifestyle and laboratory data) in predicting breast cancer [ 19 , 20 ], and several studies predicted based on mammographic stereotypes [ 21 ] or used data from patient biopsy [ 22 ]. Others showed the application of genetic data in predicting breast cancer [ 23 ]. A major challenge in predicting breast cancer is the creation of a model for addressing all known risk factors [ 24 - 26 ]. Current prediction models might only focus on the analysis of mammographic images or demographic risk factors without other critical factors. In addition, these models, which are accurate enough for identifying high-risk women, could result in multiple screening and invasive sampling with magnetic resonance imaging (MRI) and ultrasound. The financial and psychological burden could be experienced by patients [ 27 - 29 ]. The effective prediction of breast cancer risk requires different factors, including demographic, laboratory, and mammographic risk factors [ 24 , 25 , 30 , 31 ]. Therefore, multifactorial models with many risk factors in their analysis can be effective in assessing the risk of breast cancer through more accurate analysis [ 32 , 33 ]. The current study aimed to predict breast cancer using different machine learning approaches considering various factors in modeling.

Material and Methods

In this analytical study, the database was obtained from a clinical breast cancer research center (Motamed cancer institute) in Tehran, Iran. The research was conducted in 4 stages: data collection, data pre-processing, modeling, and model evaluation.

Data Collection

In the first stage, 5178 records of people, referred to the research center over the past 10 years (2011-2021), were prepared retrospectively. Each record covered 24 features (11 demographic features, 9 laboratory features, and 4 mammography features) (Table 1), all labeled to indicate the presence or absence of breast cancer, of which 1,295 records (25%) were identified as breast cancer.

Table 1

The relevant features of breast cancer

Feature name	Description	Type	Values
Age	age at diagnosis	Demographic	<100 Years
Age.menop	age of menopause	Demographic	38-65 Years
First pregnancy	age at first pregnancy	Demographic	13-42 Years
Age.menarch	age of menarche	Demographic	11-18 Years
BMI	Body mass index	Demographic	Underweight (Below 18.5) =0, Normal (18.5 - 24.9) =1, Overweight (25.0 - 29.9) =2, Obese (30.0 and Above) =3
Lactation	Breastfeeding status	Demographic	0-96 Mount
Physical Activity	Have a regular Physical Activity	Demographic	Yes=1 No=0
Education	Academic education	Demographic	Illiterate=1, primary=2, high school=3, university=4
Life event stress	life event statues	Demographic	No=0, death of father=1, family problems=2, death of mother=3, death of child=4, death of husband=5, divorced=6
Smoking	Smoking status	Demographic	Yes=1, No=0
Marital	marital status	Demographic	Single=0 other=1
Duration Ocp.used	Mount of used Oral Contraceptive Pills	Laboratory	0-120 Mount
Duration HRT used	mount of Hormone replacement therapy use	Laboratory	0-120 Mount
Personal. Other. Cancer	Personal. Other. Cancer	Laboratory	No=0, ovary=1, endometrium=2, colon=3, meningioma=4, lymphoma=5
Family.BC	FAMILY Breast Cancer	Laboratory	Yes=1 No=0
Exposure X-ray	Exposure X-ray to chest	Laboratory	Negative=0 positive=1
Vitamin D3	Amount vitamin D in body	Laboratory	>10 mg=0 deficiency 10-30 mg=1 insufficiency 30-100 mg=2 sufficient >100 mg=3 Overdose
Biopsy	pathology of biopsy	Laboratory	no malignancy detected= 0 lobular carcinoma insitu=1 ductal carcinoma insitu=2 ductal carcinoma insitu=3 invasive lobular carcinoma=4 medullary=5 microinvasion=6
Hysterectomy	history of hysterectomy	Laboratory	Yes=1 No=0
Personal.BC	Personal Breast Cancer history	Laboratory	Yes=1 No=0, surgery=2, RT (Radio Therapy) =3
Breast density	screening	Mammography	Fatty tissue=0, glandular and fibrous tissue=1, dense =2, heterogeneously dense extremely dense=3
Micro lobulated	screening	Mammography	None=0, Fibroadenoma=1, Papilloma=2, Phyllodes tumor=3, DCIS=4, IDC=5, ILC=6, Lactating and tubular adenomas =7
Circumscribed	screening	Mammography	None=0 cysts=1, complicated cyst=2, clustered microcyst=3, solid mass=4
Micro calcification, Macro calcification	screening	Mammography	Probably benign Punctate Intermediate=1 concern Coarse heterogeneous Amorphous =2 Higher probability of malignancy Fine pleomorphic Fine linear/branching=3
Class	Breast Cancer		malignant=1 benign=0

DCIS: Ductal carcinoma in situ, IDC: Invasive ductal carcinoma, ILC: Invasive lobular carcinoma

The relevant features of breast cancer DCIS: Ductal carcinoma in situ, IDC: Invasive ductal carcinoma, ILC: Invasive lobular carcinoma

Data preprocessing

The second step was associated with data preprocessing in which five records related to men were removed, and a total of 1290 records remained. Some of the patients’ laboratory features that were outside the considered range were repositioned in the central registry as their laboratory results were available. In addition, for records with missing values, the method of maximum frequency or the same mod was used. Finally, the Synthetic Minority Oversampling Technique (SMOTE) was used to balance the training data due to the difference in the number of study class records.

Modeling for breast cancer prediction

In the third step, the Scikit-Learn 0.18.2 library, NumPy v1.20, TPOT, and Python open-source programming were used for modeling. Three leaners, i.e. Random forest (RF), Gradient Boosting trees (GBT), and Multi-layer Perceptron (MLP) were applied to the dataset. In addition, the K-Fold (K=3) validation was used to gain the optimized hyper-parameter of each model in the genetic algorithm step. In the final evaluation, the train-test split method (75% for training and 25% for testing) was used to more accurately estimate the performance of the model. In this study, a genetic algorithm (GA) with a population of 5, the number of children 50, and the number of 10 generations with the criterion of the highest accuracy in model selection were used to optimize values for variables. Further, these models were then trained with demographic and laboratory features (20 features). Finally, the model was trained with all demographic, laboratory, and mammography features (24 features) to measure the effect of mammography features in predicting breast cancer. In the current study, MLP hidden layers numbers were considered 10, and the alpha value for the training rate was 0.01-0.2. The sigmoid and hyperbolic tangent functions were selected for activation function. The value of the solver optimizer function was set to a gradient-based optimizer method, such as Adam and Stochastic Gradient Descent (SGD) to find the optimal weights. In the GBT model, the learning rate was considered 0.01-0.2, and the maximum depth was regarded as 3, 5, and 8. The buoyancy level learning was 0.1 and the estimator value for the gradient boosting was 10. In the random forest (RF) model, the minimum number of sheets required to split an external node was considered 4 and 12. The estimator value was 151, and the node evaluation parameter to prevent splitting (min_samples_split) was considered 5 and 10. The block diagram for the methods is shown in Figure 1.

Figure 1

Block diagram of methods

Random Forest (RF)

As a non-parametric approach, the RF uses the classification method. For each set of data, the RF performs categorization at high speed and applies a large number of decision trees [ 34 ]. In each tree, there is a random number of input variables, then all the trees are combined for a better inference from the variables [ 35 ].

Gradient Boosting Trees (GBT)

This algorithm is one of the reinforcement gradient algorithms with a very good performance in classification and performs the best classification for each of the data [ 36 ]. In this method, the trees are trained one after another; each subset tree is taught primarily with data erroneously predicted by the previous tree. This process continuously reduces the model error since each model is sequentially improved against the weaknesses of the previous model [ 37 , 38 ].

Multi-Layer Perceptron (MLP)

As a deep artificial neural network, the MLP is composed of an input layer for receiving the signal, an output layer used for prediction, and in between those two, some hidden layers are acting as the computation engine. The MLP is trained by a backpropagation algorithm, which is part of the supervised networks. In this network, data are driven from input nodes to output nodes. If there is an error in the output, this error must be somehow returned from the output to the input, and this corrects the weights. The most commonly used method for this is the post-diffusion algorithm [ 39 , 40 ].

Genetic Algorithm (GA)

As a subset of the evolutionary computing algorithm, GA is directly associated with artificial intelligence and used for solving optimization problems through the evolution process [ 41 , 42 ]. To obtain the best answer, the GA applies the best survival rule to a series of problems for patterning the best solution for problems [ 43 , 44 ]. In each generation, the optimal solution is achieved based on a natural biological process and by selecting the best chromosomes for creating the subsequent generation to solve the problem optimally [ 45 ].

Model Evaluation

The test results of the database samples (confusion matrix) are shown in Table 2. In the final stage, the performance of the created models was measured by different criteria. The classification of samples is one of the common criteria in evaluating and measuring the ability of classifiers, the degree of separation or accuracy, and the separation of classes [ 46 ]. In this study, accuracy, sensitivity, specificity, and the area under the receiver operating characteristic (ROC) curve were used to measure the overall performance of the classifiers.

Table 2

Confusion matrix of a binominal classifier

		Predicted
		Negative	Positive
Actual	Negative	TN	FP
Actual	Positive	FN	TP

TN: True Negative, FN: False Negative, FP: False Positive, TP: True Positive

Confusion matrix of a binominal classifier TN: True Negative, FN: False Negative, FP: False Positive, TP: True Positive

Results

A total of 1290 records containing 24 demographic, laboratory, and mammographic features related to breast cancer were used in the study; the weight of the features based on their degree of importance is shown in (the weights are between (0.0 - 1) (Figure 2). Family history of breast cancer, personal history of breast cancer, breast density, and age of diagnosis is 5 important factors in the diagnosis of this disease.

Figure 2

The weight of the features in breast cancer prediction

The weight of the features in breast cancer prediction The performance of the models shown based on the ROC area under the curve demonstrated the Gradient Boosting Trees (GBT) as the model with the highest performance. The modeling results using RF, GBT, and MLP are shown in Table 3, and the comparison of their ROC curve is demonstrated in Figure 3 and Table 4.

Table 3

Performance comparison of the breast cancer prediction models

Models	Features	AUC	Sensitivity (%)	Specificity (%)	Accuracy (%)
Random Forest	Demographics	0.53	93	83	79
Random Forest	Demographics + Mammography	0.53	95	83	80
Gradient Boosting	Demographics	0.59	63	87	62
Gradient Boosting	Demographics + Mammography	0.59	82	86	74
Multi-Layer Perceptron	Demographics	0.56	78	85	71
Multi-Layer Perceptron	Demographics + Mammography	0.56	82	84	73

AUC: Area under the ROC curve, ROC: Receiver operating characteristic

Figure 3

Receiver operating characteristic (ROC) curve of models

Table 4

Area under the Receiver operating characteristic (ROC) curve

Test Result Model(s)	Area
GBT	0.59
MLP	0.56
RF	0.53

GBT: Gradient Boosting Tree, MLP: Multi-Layer-Perceptron, RF: Random Forest

Performance comparison of the breast cancer prediction models AUC: Area under the ROC curve, ROC: Receiver operating characteristic Receiver operating characteristic (ROC) curve of models Area under the Receiver operating characteristic (ROC) curve GBT: Gradient Boosting Tree, MLP: Multi-Layer-Perceptron, RF: Random Forest

Discussion

According to the findings of the current study, the mammographic features along with other features could improve the performance of models. The RF model showed the highest sensitivity (95%), but was more efficient due to the sensitivity of breast cancer diagnosis, models, such as gradient boosting with higher specificity (86%). In a study by Rosner et al. [ 47 , 48 ], the findings showed that family and personal history of breast cancer were two of the key influential factors in breast cancer, which are consistent with the findings of the current study as these two factors demonstrated the highest weight (0.92 and 0.89) compared to other factors. Breast density and age are influential in tumor appearance and increase the proportion of breast cancers [ 49 ] with the weights (0.80, 0.80), respectively. However, the hysterectomy feature was used along with other risk factors that could influence the performance of models. The study by Chow et al. assessed the risk of breast cancer after hysterectomy and showed a statistical significance between hysterectomy and breast cancer [ 50 ]. The use of optimization algorithms with feature weighting and proper adjustment of classification parameters could improve the performance of classification algorithms [ 51 ]. Studies reported that the classifiers that used GA in feature selection demonstrated better performance compared to those that did not use the GA. For the prediction of breast cancer, Bhattacharya et al. [ 52 ] approached three machine learning algorithms and used GA for feature selection; the findings of this study showed that the GA led to an improved performance for models created. In a study by Sakri et al. [ 53 ] to predict breast cancer recurrence in 198 instances with 34 clinical attributes, the GA was used for optimization. The Naive Bayes accuracy, sensitivity, specificity, and area under the ROC curve were reported at 70%, 81%, 79%, and 0.82, respectively in this study. Kumar et al. [ 54 ] used GA on a breast cancer dataset containing 611 records with 10 features to predict breast cancer survival and the reported accuracy, and ROC were 88% and 0.966 for GA, showing a better performance compared to Naive Bayes, DT, and K-nearest neighbor (KNN); in their study conducted to classify the masses observed in mammographic stereotypes, Thawkar and Ingolikar [ 55 ] used a dataset composed of 651 records with 25 mammography features. In the current study, the models were optimized by GA, and the ROC, accuracy, sensitivity, and specificity were 0.974, 95%, 96.14%, and 93.94% for RF, respectively. In the studies noted above, the modeling was performed using one set of influencing factors. Some machine-learning studies [ 56 - 62 ] reported higher accuracy (100%) and sensitivity (100%) for breast cancer prediction compared to the present study, which is likely due to using different databases, such as “Wisconsin” and “SEER”. Similar to the database used in the current study, some studies used databases from specific medical or research centers. Behravan and Hartikainen [ 33 ] predicted breast cancer using a database containing 695 records, including demographic risk factors and genetic data; their findings suggested that the XGBoost model with different factors showed improved performance (AUC= 0.788) compared to a model with just one set of factors (AUC= 0.678). In a study by Feld et al. [ 10 ] to predict breast cancer, the modeling was performed on 738 records, including demographic, genetic, and abnormal mammographic data, and the reported AUC was 0.75. Other studies suggest that considering different factors in modeling would improve modeling performance. For example, by Ayvaci MU et al. [ 63 ], the analysis of demographic, mammography, and biopsy data using logistic regression resulted in an AUC of 0.84. Rajendran k et al. [ 64 ] analyzed 2.4 million records of mammography screening and demographic risk factors associated with breast cancer to predict breast cancer using the Naïve Bayes, RF, and C4.5 techniques; the findings indicated the highest AUC (0.993) for Naïve Bayes. The findings of a study by Atashi et al. [ 65 ] conducted on a database with 4004 records, including demographic risk factors showed the higher performance of the neural network (sensitivity= %80.9, specificity= %99.8, accuracy= %62.8) compared to other approaches, such as C5.0. Mosayebi et al. study [ 66 ] was conducted on a database with 5471 records, including demographic and laboratory features reported for C.50 (accuracy 82%, sensitivity 86%. and specificity 77%). In a study by Jalali et al. [ 67 ] performed on 644 records (with 10 clinical features), the support vector machine (SVM) was reported with the highest sensitivity (94.33%), accuracy (93.72%), and specificity (92.26%). Afshar et al. [ 68 ] studied the survival of breast cancer patients using a dataset with 856 records and 15 clinical features using machine learning models. In this study, C5.0 showed the highest sensitivity (92.21%) and accuracy (84%). In addition, in a similar study by Nourelahi et al. [ 69 ] to predict patient survival on a database consisting of 5673 cases and 41 clinical features, logistic regression presented a sensitivity of 71.85%, specificity of 72.83%, and accuracy of 72.49%. In addition, Tapak et al. [ 70 ] performed a study on a database with 550 records to predict the survival and metastasis of breast cancer and also reported the sensitivity and specificity of 99% for AdaBoost, the findings of the current study suggest that modeling with a variety of related risk factors from different sources could improve the performance of models in breast cancer prediction. In the current study, limitations are considered as follows: modeling based on records of only one database, and the lack of access to genetic data that could influence the findings of the study. However, different machine learning approaches were used considering demographic, laboratory, and mammography features, resulting in comparing the performance of different approaches in predicting breast cancer.

Conclusion

The proposed machine-learning approaches could predict breast cancer as the early detection of this disease could help slow down the progress of the disease and reduce the mortality rate through appropriate therapeutic interventions at the right time. Applying different machine learning approaches, accessibility to bigger datasets from different institutions (multi-center study), and considering key features from a variety of relevant data sources could improve the performance of modeling.

Authors’ Contribution

R. Rabiei proposed conceptualization and design, supervision of modeling, manuscript drafting, editing, and critical review. Data modeling, interpretation, and manuscript drafting was done by SM. Ayyoubzadeh. S. Sohrabei provided conceptualization and design, data modeling and interpretation, manuscript drafting, and editing. M. Esmaeili presented data interpretation and manuscript drafting. A. Atashi collected data and manuscript drafting. All the authors read, modified, and approved the final version of the manuscript.

Ethical Approval

This study was approved by Clinical Research Department, Breast Cancer Research Center, Motamed Cancer Institute (ACECR), Tehran, Iran, with Approval ID IR, ACECR, IBCRC, REC.1394.68.

Informed consent

We used anonymous data for modeling and no consent was required for conducting this study.

Funding

There was no funding for conducting this study.

Conflict of Interest

None

32 in total

1. Improving breast cancer risk prediction by using demographic risk factors, abnormality features on mammograms and genetic variants.

Authors: Shara I Feld; Kaitlin M Woo; Roxana Alexandridis; Yirong Wu; Jie Liu; Peggy Peissig; Adedayo A Onitilo; Jennifer Cox; C David Page; Elizabeth S Burnside
Journal: AMIA Annu Symp Proc Date: 2018-12-05

2. Comparing Mammography Abnormality Features to Genetic Variants in the Prediction of Breast Cancer in Women Recommended for Breast Biopsy.

Authors: Elizabeth S Burnside; Jie Liu; Yirong Wu; Adedayo A Onitilo; Catherine A McCarty; C David Page; Peggy L Peissig; Amy Trentham-Dietz; Terrie Kitchner; Jun Fan; Ming Yuan
Journal: Acad Radiol Date: 2015-10-26 Impact factor: 3.173

3. Bayesian network to predict breast cancer risk of mammographic microcalcifications and reduce number of benign biopsy results: initial experience.

Authors: Elizabeth S Burnside; Daniel L Rubin; Jason P Fine; Ross D Shachter; Gale A Sisney; Winifred K Leung
Journal: Radiology Date: 2006-09 Impact factor: 11.105

4. Identifying key factors for the effectiveness of pancreatic cancer screening: A model-based analysis.

Authors: Brechtje D M Koopmann; Femme Harinck; Sonja Kroep; Ingrid C A W Konings; Steffie K Naber; Iris Lansdorp-Vogelaar; Paul Fockens; Jeanin E van Hooft; Djuna L Cahen; Marjolein van Ballegooijen; Marco J Bruno; Inge M C M de Kok
Journal: Int J Cancer Date: 2021-03-25 Impact factor: 7.396