| Literature DB >> 31909063 |
Maria C Mariani1, Osei K Tweneboah2, Md Al Masum Bhuiyan2.
Abstract
This work analyses the diagnosis and prognosis of cancer and heart disease data using five Machine Learning (ML) algorithms. We compare the predictive ability of all the ML algorithms to breast cancer and heart disease. The important variables that causes cancer and heart disease are also studied. We predict the test data based on the important variables and compute the prediction accuracy using the Receiver Operating Characteristic (ROC) curve. The Random Forest (RF) and Principal Component Regression (PCR) provides the best performance in analyzing the breast cancer and heart disease data respectively.Entities:
Keywords: breast cancer; heart disease; machine learning; predictive models; supervised learning
Year: 2019 PMID: 31909063 PMCID: PMC6940574 DOI: 10.3934/publichealth.2019.4.405
Source DB: PubMed Journal: AIMS Public Health ISSN: 2327-8994
Background of cancer data.
| Variables | Coefficients |
| Id | Sample code number |
| Cl. thickness | Clump Thickness |
| Cell. size | Uniformity of Cell Size |
| Cell. shape | Uniformity of Cell Shape |
| Marg. adhesion | Marginal Adhesion |
| Epith. c. size | Single Epithelial Cell Size |
| Bare. nuclei | Bare Nuclei |
| Bl. cromatin | Bland Chromatin |
| Normal. nucleoli | Normal Nucleoli |
| Mitoses | Mitoses |
| Class | Class |
Background of heart disease data.
| Variables | Coefficients |
| Age | Age of patient |
| sex | Sex, 1 for male |
| cp | chest pain |
| trestbps | resting blood pressure |
| chol | serum cholesterol |
| fbs | fasting blood sugar larger 120mg/dl (1 true) |
| restecg | resting electroc. result (1 anomality) |
| thalach | maximum heart rate achieved |
| Exang | exercise induced angina |
| Oldpeak | ST depression induced by exercise relative to rest |
| Slope | the slope of the peak exercise ST segment |
| ca | number of major vessel |
| Thal | Thalassamia |
| num | angiographic disease status |
Figure 1.Association among the cancer variables.
Figure 2.Association among the heart disease variables.
Coefficients of important predictors using LGR(L1) model for Cancer data.
| Variables | Coefficients |
| Cl.thickness | − 0.4891 |
| Cell.size | 0.0000 |
| Cell.shape | − 0.2656 |
| Marg.adhesion | − 0.3596 |
| Epith.c.size | − 0.2128 |
| Bare.nuclei | − 0.2988 |
| Bl.cromatin | − 0.3582 |
| Normal.nucleoli | − 0.1435 |
| Mitoses | − 0.30637 |
Coefficients of important predictors using LGR(L1) model for heart data.
| Variables | Coefficients |
| Age | 0.0000 |
| Sex | 0.0000 |
| Cp | 0.2847 |
| Trestbps | 0.0000 |
| Chol | 0.0000 |
| Fbs | 0.0000 |
| Restecg | 0.0000 |
| Thalach | − 0.0073 |
| Exang | 0.4792 |
| Oldpeak | 0.1605 |
| Slope | 0.2172 |
| Ca | 0.4164 |
| Thal | 0.3021 |
Model evaluation.
| Models | Cancer Data | Heart-disease Data | ||
| PMSE | MCR | PMSE | MCR | |
| LGR-L_1 | 0.0306 | 0.0284 | 0.1512 | 0.1720 |
| LGR-L_2 | 0.0349 | 0.0349 | 0.1491 | 0.1935 |
| PCR | 0.0384 | 0.0349 | 0.1078 | 0.1182 |
| RF | 0.0205 | 0.0262 | 0.1447 | 0.1720 |
| MARS | 0.0203 | 0.0305 | 0.1588 | 0.2043 |
| SVM-linear | 0.0305 | 0.0305 | 0.1935 | 0.1935 |
| SVM-nonlinear | 0.0219 | 0.0305 | 0.1445 | 0.1627 |
Figure 3.Variable importance plot using random forest model for cancer data.
Figure 4.Variable importance plot using random forest model for heart-disease data.
Figure 5.Variable importance plot using mars model for cancer data.
Figure 6.Variable importance plot using MARS model for heart disease data.
Prediction accuracy for cancer data.
| Models | Sensitivity (%) | Specificity (%) | Accuracy (%) | Conf. Interval (%) |
| LGR-L_1 | 92.19 | 98.94 | 96.20 | (91.92–98.59) |
| LGR-L_2 | 83.56 | 99.36 | 94.32 | (90.49–96.49) |
| PCR | 80.82 | 1.000 | 93.89 | (89.96–96.62) |
| RF | 97.26 | 97.44 | 97.38 | (94.38–99.03) |
| MARS | 94.52 | 98.08 | 96.94 | (93.80–98.76) |
| SVM-linear | 94.54 | 98.06 | 96.94 | (93.80–98.76) |
| SVM-nonlinear | 93.42 | 98.69 | 96.94 | (93.81–98.76) |
Figure 7.Model evaluation using ROC curve for cancer data.
Figure 8.Model evaluation using ROC curve for heart disease data.
Prediction accuracy for heart disease data.
| Models | Sensitivity (%) | Specificity (%) | Accuracy (%) | Conf. Interval (%) |
| [0.5ex] LGR-L_1 | 77.36 | 90.00 | 82.20 | (73.57–89.83) |
| LGR-L_2 | 91.11 | 72.92 | 81.72 | (72.35–88.92) |
| PCR | 95.56 | 81.25 | 88.17 | (79.82–93.95) |
| RF | 78.43 | 88.10 | 82.80 | (73.57–89.83) |
| MARS | 84.44 | 75.00 | 79.57 | (69.95–87.23) |
| SVM-linear | 77.55 | 84.09 | 80.65 | (71.15–88.11) |
| SVM-nonlinear | 78.00 | 86.05 | 81.72 | (72.35–88.98) |