Literature DB >> 35698545

Prediction of Breast Cancer using Machine Learning Approaches.

Reza Rabiei1, Seyed Mohammad Ayyoubzadeh2, Solmaz Sohrabei3, Marzieh Esmaeili2, Alireza Atashi4.   

Abstract

Background: Breast cancer is considered one of the most common cancers in women caused by various clinical, lifestyle, social, and economic factors. Machine learning has the potential to predict breast cancer based on features hidden in data. Objective: This study aimed to predict breast cancer using different machine-learning approaches applying demographic, laboratory, and mammographic data. Material and
Methods: In this analytical study, the database, including 5,178 independent records, 25% of which belonged to breast cancer patients with 24 attributes in each record was obtained from Motamed cancer institute (ACECR), Tehran, Iran. The database contained 5,178 independent records, 25% of which belonged to breast cancer patients containing 24 attributes in each record. The random forest (RF), neural network (MLP), gradient boosting trees (GBT), and genetic algorithms (GA) were used in this study. Models were initially trained with demographic and laboratory features (20 features). The models were then trained with all demographic, laboratory, and mammographic features (24 features) to measure the effectiveness of mammography features in predicting breast cancer.
Results: RF presented higher performance compared to other techniques (accuracy 80%, sensitivity 95%, specificity 80%, and the area under the curve (AUC) 0.56). Gradient boosting (AUC=0.59) showed a stronger performance compared to the neural network.
Conclusion: Combining multiple risk factors in modeling for breast cancer prediction could help the early diagnosis of the disease with necessary care plans. Collection, storage, and management of different data and intelligent systems based on multiple factors for predicting breast cancer are effective in disease management. Copyright: © Journal of Biomedical Physics and Engineering.

Entities:  

Keywords:  Artificial Intelligence; Breast Cancer; Computing Methodologies; Genetic Algorithm; Machine Learning

Year:  2022        PMID: 35698545      PMCID: PMC9175124          DOI: 10.31661/jbpe.v0i0.2109-1403

Source DB:  PubMed          Journal:  J Biomed Phys Eng        ISSN: 2251-7200


Introduction

Breast cancer is considered a multifactorial disease and the most common cancer in women worldwide [ 1 , 2 ] with approximately 30% of all female cancers [ 3 , 4 ] (i.e. 1.5 million women are diagnosed with breast cancer each year, and 500,000 women die from this disease in the world). Over the past 30 years, this disease has increased, while the death rate has decreased. However, the reduction in mortality due to mammography screening is estimated at 20% and improvement in cancer treatment is estimated at 60% [ 5 , 6 ]. Diagnostic mammography can assess abnormal breast cancer tissue in patients with subtle and inconspicuous malignancy signs. Due to a large number of images, this method cannot effectively be used in assessing cancer suspected areas. According to a report, approximately 50% of breast cancers were not detected in screenings of women with very dense breast tissue [ 7 ]. However, about a quarter of women with breast cancer are diagnosed negatively within two years of screening. Therefore, the early and timely diagnosis of breast cancer is crucial [ 8 ]. Most mammography-based breast cancer screening is performed at regular intervals - usually annually or every two years - for all women. This “A fix screening program for everyone” is not effective in diagnosing cancer at the individual level and may impair the effectiveness of screening programs [ 9 ]. On the other hand, experts suggest that considering other risk factors along with mammography screening can help a more accurate diagnosis of women at risk [ 9 - 11 ]. Moreover, effective risk prediction through modeling can not only help radiologists in setting up a personal screening for patients and encouraging them to participate in the program for early detection but also help identify high-risk patients [ 12 , 13 ]. Machine learning, as a modeling approach, represents the process of extracting knowledge from data and discovering hidden relationships [ 14 ], widely used in healthcare in recent years [ 15 ] to predict different diseases [ 16 - 18 ]. Some studies only used demographic risk factors (lifestyle and laboratory data) in predicting breast cancer [ 19 , 20 ], and several studies predicted based on mammographic stereotypes [ 21 ] or used data from patient biopsy [ 22 ]. Others showed the application of genetic data in predicting breast cancer [ 23 ]. A major challenge in predicting breast cancer is the creation of a model for addressing all known risk factors [ 24 - 26 ]. Current prediction models might only focus on the analysis of mammographic images or demographic risk factors without other critical factors. In addition, these models, which are accurate enough for identifying high-risk women, could result in multiple screening and invasive sampling with magnetic resonance imaging (MRI) and ultrasound. The financial and psychological burden could be experienced by patients [ 27 - 29 ]. The effective prediction of breast cancer risk requires different factors, including demographic, laboratory, and mammographic risk factors [ 24 , 25 , 30 , 31 ]. Therefore, multifactorial models with many risk factors in their analysis can be effective in assessing the risk of breast cancer through more accurate analysis [ 32 , 33 ]. The current study aimed to predict breast cancer using different machine learning approaches considering various factors in modeling.

Material and Methods

In this analytical study, the database was obtained from a clinical breast cancer research center (Motamed cancer institute) in Tehran, Iran. The research was conducted in 4 stages: data collection, data pre-processing, modeling, and model evaluation.

Data Collection

In the first stage, 5178 records of people, referred to the research center over the past 10 years (2011-2021), were prepared retrospectively. Each record covered 24 features (11 demographic features, 9 laboratory features, and 4 mammography features) (Table 1), all labeled to indicate the presence or absence of breast cancer, of which 1,295 records (25%) were identified as breast cancer.
Table 1

The relevant features of breast cancer

Feature nameDescriptionTypeValues
Ageage at diagnosisDemographic<100 Years
Age.menopage of menopauseDemographic38-65 Years
First pregnancyage at first pregnancyDemographic13-42 Years
Age.menarchage of menarcheDemographic11-18 Years
BMIBody mass indexDemographicUnderweight (Below 18.5) =0, Normal (18.5 - 24.9) =1, Overweight (25.0 - 29.9) =2, Obese (30.0 and Above) =3
LactationBreastfeeding statusDemographic0-96 Mount
Physical ActivityHave a regular Physical ActivityDemographicYes=1 No=0
EducationAcademic educationDemographicIlliterate=1, primary=2, high school=3, university=4
Life event stress life event statues DemographicNo=0, death of father=1, family problems=2, death of mother=3, death of child=4, death of husband=5, divorced=6
SmokingSmoking statusDemographicYes=1, No=0
Maritalmarital statusDemographicSingle=0 other=1
Duration Ocp.usedMount of used Oral Contraceptive PillsLaboratory0-120 Mount
Duration HRT usedmount of Hormone replacement therapy useLaboratory0-120 Mount
Personal. Other. CancerPersonal. Other. CancerLaboratoryNo=0, ovary=1, endometrium=2, colon=3, meningioma=4, lymphoma=5
Family.BCFAMILY Breast CancerLaboratoryYes=1 No=0
Exposure X-ray Exposure X-ray to chestLaboratoryNegative=0 positive=1
Vitamin D3Amount vitamin D in bodyLaboratory>10 mg=0 deficiency 10-30 mg=1 insufficiency 30-100 mg=2 sufficient >100 mg=3 Overdose
Biopsypathology of biopsyLaboratoryno malignancy detected= 0 lobular carcinoma insitu=1 ductal carcinoma insitu=2 ductal carcinoma insitu=3 invasive lobular carcinoma=4 medullary=5 microinvasion=6
Hysterectomyhistory of hysterectomy LaboratoryYes=1 No=0
Personal.BCPersonal Breast Cancer historyLaboratoryYes=1 No=0, surgery=2, RT (Radio Therapy) =3
Breast densityscreeningMammographyFatty tissue=0, glandular and fibrous tissue=1, dense =2, heterogeneously dense extremely dense=3
Micro lobulatedscreeningMammographyNone=0, Fibroadenoma=1, Papilloma=2, Phyllodes tumor=3, DCIS=4, IDC=5, ILC=6, Lactating and tubular adenomas =7
CircumscribedscreeningMammographyNone=0 cysts=1, complicated cyst=2, clustered microcyst=3, solid mass=4
Micro calcification, Macro calcificationscreeningMammographyProbably benign Punctate Intermediate=1 concern Coarse heterogeneous Amorphous =2 Higher probability of malignancy Fine pleomorphic Fine linear/branching=3
ClassBreast Cancermalignant=1 benign=0

DCIS: Ductal carcinoma in situ, IDC: Invasive ductal carcinoma, ILC: Invasive lobular carcinoma

The relevant features of breast cancer DCIS: Ductal carcinoma in situ, IDC: Invasive ductal carcinoma, ILC: Invasive lobular carcinoma

Data preprocessing

The second step was associated with data preprocessing in which five records related to men were removed, and a total of 1290 records remained. Some of the patients’ laboratory features that were outside the considered range were repositioned in the central registry as their laboratory results were available. In addition, for records with missing values, the method of maximum frequency or the same mod was used. Finally, the Synthetic Minority Oversampling Technique (SMOTE) was used to balance the training data due to the difference in the number of study class records.

Modeling for breast cancer prediction

In the third step, the Scikit-Learn 0.18.2 library, NumPy v1.20, TPOT, and Python open-source programming were used for modeling. Three leaners, i.e. Random forest (RF), Gradient Boosting trees (GBT), and Multi-layer Perceptron (MLP) were applied to the dataset. In addition, the K-Fold (K=3) validation was used to gain the optimized hyper-parameter of each model in the genetic algorithm step. In the final evaluation, the train-test split method (75% for training and 25% for testing) was used to more accurately estimate the performance of the model. In this study, a genetic algorithm (GA) with a population of 5, the number of children 50, and the number of 10 generations with the criterion of the highest accuracy in model selection were used to optimize values for variables. Further, these models were then trained with demographic and laboratory features (20 features). Finally, the model was trained with all demographic, laboratory, and mammography features (24 features) to measure the effect of mammography features in predicting breast cancer. In the current study, MLP hidden layers numbers were considered 10, and the alpha value for the training rate was 0.01-0.2. The sigmoid and hyperbolic tangent functions were selected for activation function. The value of the solver optimizer function was set to a gradient-based optimizer method, such as Adam and Stochastic Gradient Descent (SGD) to find the optimal weights. In the GBT model, the learning rate was considered 0.01-0.2, and the maximum depth was regarded as 3, 5, and 8. The buoyancy level learning was 0.1 and the estimator value for the gradient boosting was 10. In the random forest (RF) model, the minimum number of sheets required to split an external node was considered 4 and 12. The estimator value was 151, and the node evaluation parameter to prevent splitting (min_samples_split) was considered 5 and 10. The block diagram for the methods is shown in Figure 1.
Figure 1

Block diagram of methods

Block diagram of methods

Random Forest (RF)

As a non-parametric approach, the RF uses the classification method. For each set of data, the RF performs categorization at high speed and applies a large number of decision trees [ 34 ]. In each tree, there is a random number of input variables, then all the trees are combined for a better inference from the variables [ 35 ].

Gradient Boosting Trees (GBT)

This algorithm is one of the reinforcement gradient algorithms with a very good performance in classification and performs the best classification for each of the data [ 36 ]. In this method, the trees are trained one after another; each subset tree is taught primarily with data erroneously predicted by the previous tree. This process continuously reduces the model error since each model is sequentially improved against the weaknesses of the previous model [ 37 , 38 ].

Multi-Layer Perceptron (MLP)

As a deep artificial neural network, the MLP is composed of an input layer for receiving the signal, an output layer used for prediction, and in between those two, some hidden layers are acting as the computation engine. The MLP is trained by a backpropagation algorithm, which is part of the supervised networks. In this network, data are driven from input nodes to output nodes. If there is an error in the output, this error must be somehow returned from the output to the input, and this corrects the weights. The most commonly used method for this is the post-diffusion algorithm [ 39 , 40 ].

Genetic Algorithm (GA)

As a subset of the evolutionary computing algorithm, GA is directly associated with artificial intelligence and used for solving optimization problems through the evolution process [ 41 , 42 ]. To obtain the best answer, the GA applies the best survival rule to a series of problems for patterning the best solution for problems [ 43 , 44 ]. In each generation, the optimal solution is achieved based on a natural biological process and by selecting the best chromosomes for creating the subsequent generation to solve the problem optimally [ 45 ].

Model Evaluation

The test results of the database samples (confusion matrix) are shown in Table 2. In the final stage, the performance of the created models was measured by different criteria. The classification of samples is one of the common criteria in evaluating and measuring the ability of classifiers, the degree of separation or accuracy, and the separation of classes [ 46 ]. In this study, accuracy, sensitivity, specificity, and the area under the receiver operating characteristic (ROC) curve were used to measure the overall performance of the classifiers.
Table 2

Confusion matrix of a binominal classifier

Predicted
NegativePositive
ActualNegativeTNFP
PositiveFNTP

TN: True Negative, FN: False Negative, FP: False Positive, TP: True Positive

Confusion matrix of a binominal classifier TN: True Negative, FN: False Negative, FP: False Positive, TP: True Positive

Results

A total of 1290 records containing 24 demographic, laboratory, and mammographic features related to breast cancer were used in the study; the weight of the features based on their degree of importance is shown in (the weights are between (0.0 - 1) (Figure 2). Family history of breast cancer, personal history of breast cancer, breast density, and age of diagnosis is 5 important factors in the diagnosis of this disease.
Figure 2

The weight of the features in breast cancer prediction

The weight of the features in breast cancer prediction The performance of the models shown based on the ROC area under the curve demonstrated the Gradient Boosting Trees (GBT) as the model with the highest performance. The modeling results using RF, GBT, and MLP are shown in Table 3, and the comparison of their ROC curve is demonstrated in Figure 3 and Table 4.
Table 3

Performance comparison of the breast cancer prediction models

Models FeaturesAUCSensitivity (%)Specificity (%)Accuracy (%)
Random ForestDemographics0.53938379
Demographics + Mammography0.53 958380
Gradient BoostingDemographics0.59638762
Demographics + Mammography0.59828674
Multi-Layer PerceptronDemographics 0.56788571
Demographics + Mammography0.56828473

AUC: Area under the ROC curve, ROC: Receiver operating characteristic

Figure 3

Receiver operating characteristic (ROC) curve of models

Table 4

Area under the Receiver operating characteristic (ROC) curve

Test Result Model(s)Area
GBT0.59
MLP0.56
RF0.53

GBT: Gradient Boosting Tree, MLP: Multi-Layer-Perceptron, RF: Random Forest

Performance comparison of the breast cancer prediction models AUC: Area under the ROC curve, ROC: Receiver operating characteristic Receiver operating characteristic (ROC) curve of models Area under the Receiver operating characteristic (ROC) curve GBT: Gradient Boosting Tree, MLP: Multi-Layer-Perceptron, RF: Random Forest

Discussion

According to the findings of the current study, the mammographic features along with other features could improve the performance of models. The RF model showed the highest sensitivity (95%), but was more efficient due to the sensitivity of breast cancer diagnosis, models, such as gradient boosting with higher specificity (86%). In a study by Rosner et al. [ 47 , 48 ], the findings showed that family and personal history of breast cancer were two of the key influential factors in breast cancer, which are consistent with the findings of the current study as these two factors demonstrated the highest weight (0.92 and 0.89) compared to other factors. Breast density and age are influential in tumor appearance and increase the proportion of breast cancers [ 49 ] with the weights (0.80, 0.80), respectively. However, the hysterectomy feature was used along with other risk factors that could influence the performance of models. The study by Chow et al. assessed the risk of breast cancer after hysterectomy and showed a statistical significance between hysterectomy and breast cancer [ 50 ]. The use of optimization algorithms with feature weighting and proper adjustment of classification parameters could improve the performance of classification algorithms [ 51 ]. Studies reported that the classifiers that used GA in feature selection demonstrated better performance compared to those that did not use the GA. For the prediction of breast cancer, Bhattacharya et al. [ 52 ] approached three machine learning algorithms and used GA for feature selection; the findings of this study showed that the GA led to an improved performance for models created. In a study by Sakri et al. [ 53 ] to predict breast cancer recurrence in 198 instances with 34 clinical attributes, the GA was used for optimization. The Naive Bayes accuracy, sensitivity, specificity, and area under the ROC curve were reported at 70%, 81%, 79%, and 0.82, respectively in this study. Kumar et al. [ 54 ] used GA on a breast cancer dataset containing 611 records with 10 features to predict breast cancer survival and the reported accuracy, and ROC were 88% and 0.966 for GA, showing a better performance compared to Naive Bayes, DT, and K-nearest neighbor (KNN); in their study conducted to classify the masses observed in mammographic stereotypes, Thawkar and Ingolikar [ 55 ] used a dataset composed of 651 records with 25 mammography features. In the current study, the models were optimized by GA, and the ROC, accuracy, sensitivity, and specificity were 0.974, 95%, 96.14%, and 93.94% for RF, respectively. In the studies noted above, the modeling was performed using one set of influencing factors. Some machine-learning studies [ 56 - 62 ] reported higher accuracy (100%) and sensitivity (100%) for breast cancer prediction compared to the present study, which is likely due to using different databases, such as “Wisconsin” and “SEER”. Similar to the database used in the current study, some studies used databases from specific medical or research centers. Behravan and Hartikainen [ 33 ] predicted breast cancer using a database containing 695 records, including demographic risk factors and genetic data; their findings suggested that the XGBoost model with different factors showed improved performance (AUC= 0.788) compared to a model with just one set of factors (AUC= 0.678). In a study by Feld et al. [ 10 ] to predict breast cancer, the modeling was performed on 738 records, including demographic, genetic, and abnormal mammographic data, and the reported AUC was 0.75. Other studies suggest that considering different factors in modeling would improve modeling performance. For example, by Ayvaci MU et al. [ 63 ], the analysis of demographic, mammography, and biopsy data using logistic regression resulted in an AUC of 0.84. Rajendran k et al. [ 64 ] analyzed 2.4 million records of mammography screening and demographic risk factors associated with breast cancer to predict breast cancer using the Naïve Bayes, RF, and C4.5 techniques; the findings indicated the highest AUC (0.993) for Naïve Bayes. The findings of a study by Atashi et al. [ 65 ] conducted on a database with 4004 records, including demographic risk factors showed the higher performance of the neural network (sensitivity= %80.9, specificity= %99.8, accuracy= %62.8) compared to other approaches, such as C5.0. Mosayebi et al. study [ 66 ] was conducted on a database with 5471 records, including demographic and laboratory features reported for C.50 (accuracy 82%, sensitivity 86%. and specificity 77%). In a study by Jalali et al. [ 67 ] performed on 644 records (with 10 clinical features), the support vector machine (SVM) was reported with the highest sensitivity (94.33%), accuracy (93.72%), and specificity (92.26%). Afshar et al. [ 68 ] studied the survival of breast cancer patients using a dataset with 856 records and 15 clinical features using machine learning models. In this study, C5.0 showed the highest sensitivity (92.21%) and accuracy (84%). In addition, in a similar study by Nourelahi et al. [ 69 ] to predict patient survival on a database consisting of 5673 cases and 41 clinical features, logistic regression presented a sensitivity of 71.85%, specificity of 72.83%, and accuracy of 72.49%. In addition, Tapak et al. [ 70 ] performed a study on a database with 550 records to predict the survival and metastasis of breast cancer and also reported the sensitivity and specificity of 99% for AdaBoost, the findings of the current study suggest that modeling with a variety of related risk factors from different sources could improve the performance of models in breast cancer prediction. In the current study, limitations are considered as follows: modeling based on records of only one database, and the lack of access to genetic data that could influence the findings of the study. However, different machine learning approaches were used considering demographic, laboratory, and mammography features, resulting in comparing the performance of different approaches in predicting breast cancer.

Conclusion

The proposed machine-learning approaches could predict breast cancer as the early detection of this disease could help slow down the progress of the disease and reduce the mortality rate through appropriate therapeutic interventions at the right time. Applying different machine learning approaches, accessibility to bigger datasets from different institutions (multi-center study), and considering key features from a variety of relevant data sources could improve the performance of modeling.

Authors’ Contribution

R. Rabiei proposed conceptualization and design, supervision of modeling, manuscript drafting, editing, and critical review. Data modeling, interpretation, and manuscript drafting was done by SM. Ayyoubzadeh. S. Sohrabei provided conceptualization and design, data modeling and interpretation, manuscript drafting, and editing. M. Esmaeili presented data interpretation and manuscript drafting. A. Atashi collected data and manuscript drafting. All the authors read, modified, and approved the final version of the manuscript.

Ethical Approval

This study was approved by Clinical Research Department, Breast Cancer Research Center, Motamed Cancer Institute (ACECR), Tehran, Iran, with Approval ID IR, ACECR, IBCRC, REC.1394.68.

Informed consent

We used anonymous data for modeling and no consent was required for conducting this study.

Funding

There was no funding for conducting this study.

Conflict of Interest

None
  32 in total

1.  Improving breast cancer risk prediction by using demographic risk factors, abnormality features on mammograms and genetic variants.

Authors:  Shara I Feld; Kaitlin M Woo; Roxana Alexandridis; Yirong Wu; Jie Liu; Peggy Peissig; Adedayo A Onitilo; Jennifer Cox; C David Page; Elizabeth S Burnside
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

2.  Comparing Mammography Abnormality Features to Genetic Variants in the Prediction of Breast Cancer in Women Recommended for Breast Biopsy.

Authors:  Elizabeth S Burnside; Jie Liu; Yirong Wu; Adedayo A Onitilo; Catherine A McCarty; C David Page; Peggy L Peissig; Amy Trentham-Dietz; Terrie Kitchner; Jun Fan; Ming Yuan
Journal:  Acad Radiol       Date:  2015-10-26       Impact factor: 3.173

3.  Bayesian network to predict breast cancer risk of mammographic microcalcifications and reduce number of benign biopsy results: initial experience.

Authors:  Elizabeth S Burnside; Daniel L Rubin; Jason P Fine; Ross D Shachter; Gale A Sisney; Winifred K Leung
Journal:  Radiology       Date:  2006-09       Impact factor: 11.105

4.  Identifying key factors for the effectiveness of pancreatic cancer screening: A model-based analysis.

Authors:  Brechtje D M Koopmann; Femme Harinck; Sonja Kroep; Ingrid C A W Konings; Steffie K Naber; Iris Lansdorp-Vogelaar; Paul Fockens; Jeanin E van Hooft; Djuna L Cahen; Marjolein van Ballegooijen; Marco J Bruno; Inge M C M de Kok
Journal:  Int J Cancer       Date:  2021-03-25       Impact factor: 7.396

Review 5.  Common breast cancer risk variants in the post-COGS era: a comprehensive review.

Authors:  Kara N Maxwell; Katherine L Nathanson
Journal:  Breast Cancer Res       Date:  2013-12-20       Impact factor: 6.466

Review 6.  Risk Factors and Preventions of Breast Cancer.

Authors:  Yi-Sheng Sun; Zhao Zhao; Zhang-Nv Yang; Fang Xu; Hang-Jing Lu; Zhi-Yong Zhu; Wen Shi; Jianmin Jiang; Ping-Ping Yao; Han-Ping Zhu
Journal:  Int J Biol Sci       Date:  2017-11-01       Impact factor: 6.580

Review 7.  Progress and prospects of early detection in lung cancer.

Authors:  Sean Blandin Knight; Phil A Crosbie; Haval Balata; Jakub Chudziak; Tracy Hussell; Caroline Dive
Journal:  Open Biol       Date:  2017-09       Impact factor: 6.411

8.  Breast cancer prediction using genome wide single nucleotide polymorphism data.

Authors:  Mohsen Hajiloo; Babak Damavandi; Metanat Hooshsadat; Farzad Sangi; John R Mackey; Carol E Cass; Russell Greiner; Sambasivarao Damaraju
Journal:  BMC Bioinformatics       Date:  2013-10-01       Impact factor: 3.169

9.  Predicting Breast Cancer in Chinese Women Using Machine Learning Techniques: Algorithm Development.

Authors:  Can Hou; Xiaorong Zhong; Hong Zheng; Jiayuan Li; Ping He; Bin Xu; Sha Diao; Fang Yi
Journal:  JMIR Med Inform       Date:  2020-06-08
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.