| Literature DB >> 34179828 |
L J Muhammad1, Ibrahem Al-Shourbaji2, Ahmed Abba Haruna3, I A Mohammed4, Abdulkadir Ahmad5, Muhammed Besiru Jibrin1.
Abstract
Coronary artery disease (CAD) is the commonest type of heart disease and over 80% of the deaths resulted from the diseases occurred in developing countries including Nigeria, with majority being in those victims are below 70 years of age. Though, CAD is not a well known disease in Nigeria but however in year 2014, 2.82% of the total of deaths occurred in the country were due to the disease. In this study, a machine leaning predictive models for CAD has been developed with diagnostic CAD dataset obtained in the two General Hospitals in Kano State-Nigeria. The dataset applied on machine learning algorithms which include support vector machine, K nearest neighbor, random tree, Naïve Bayes, gradient boosting and logistic regression algorithms to build the predictive models and the models were evaluated based accuracy, specificity, sensitivity and receiver operating curve (ROC) performance evaluation techniques. In terms of accuracy random forest-based machine learning model emerged to be the best model with 92.04%, for specificity Naive Bayes based machine learning model emerged to be the best model with 92.40%, while for sensitivity support vector machine based machine learning model emerged to be the best model with 87.34% and for ROC, random forest-based machine learning model emerged to be the best model with 92.20%. The decision tree generated with random forest machine learning algorithm which happened to be best model in terms accuracy and ROC can be converted into production rules and be used develop expert system for diagnosis of CAD patients in Nigeria.Entities:
Keywords: CAD; Disease; Machine learning; Predictive model
Year: 2021 PMID: 34179828 PMCID: PMC8218284 DOI: 10.1007/s42979-021-00731-4
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
Fig. 1Study methods and materials
Description of the dataset features
| SN | Feature | Units | Range |
|---|---|---|---|
| 1 | Age | Years | 1–150 |
| 2 | Sex | Male (1), female (0) | 0.1 |
| 3 | Family history | Yes (1), no (0) | 0.1 |
| 4 | Smoking | Yes (1), no (0) | 0.1 |
| 5 | Diabetes | Yes (1), no (0) | 0.1 |
| 6 | Hypertension | Yes (1), no (0) | 0.1 |
| 7 | Hyperlipimedia | Yes (1), no (0) | 0.1 |
| 8 | Blood pressure | mmHg | 90–190 |
| 9 | Glucose | mg/dL | 37–295 |
| 10 | Cholesterol | mg/dL | 128–575 |
| 11 | Triglyceride | mg/dL | 40–690 |
| 12 | HDL | mg/dL | 10.6–73 |
| 13 | LDL | mg/dL | 10–220 |
| 14 | Creatinine | mg/dL | 0.6–3.3 |
| 15 | Body mass index | kg/m2 | 20.28–40.25 |
| 16 | Heart rate | Bpm | 42–124 |
| 17 | Chest pain | Typical angina (4), atypical angina(3), non-anginal pain(2), asymptomatic (1) | 1–4 |
| 18 | Diagnosis of CAD | Positive (1), negative (2) | 0,1 |
NB mmHg millimeters of mercury, mg/dL milligrams per deciliter, kg/m kilogram-meter squared, Bpm beats per minute
Fig. 2Data type description of the dataset features
Fig. 3Sample of the dataset
Fig. 4Profile information of the dataset
Fig. 5Frequency of age of the CAD patients
Fig. 6Frequency of CAD family history of the patients
Fig. 7Frequency of the body mass of the patients
Fig. 8Frequency of the CAD diagnosis of the patients
r value of the correlation coefficient analysis
| SN | Dependent feature | Independent feature | Correlation coefficient relationship | |
|---|---|---|---|---|
| 1 | Age | Medical diagnostic result | 0.42 | Moderate uphill positive correlation coefficient relationship |
| 2 | Sex | Medical diagnostic result | 0.50 | Moderate uphill positive correlation coefficient relationship |
| 3 | Family history | Medical diagnostic result | 0.48 | Moderate uphill positive correlation coefficient relationship |
| 4 | Smoking | Medical diagnostic result | 0.24 | Weak uphill positive correlation coefficient relationship |
| 5 | Chest pain | Medical diagnostic result | 0.58 | Moderate uphill positive correlation coefficient relationship |
| 6 | Diabetes | Medical diagnostic result | 0.61 | Strong uphill positive correlation coefficient relationship |
| 7 | Glucose | Medical diagnostic result | 0.55 | Moderate uphill positive correlation coefficient relationship |
| 8 | Hypertension | Medical diagnostic result | 0.65 | Strong uphill positive correlation coefficient relationship |
| 9 | Blood pressure | Medical diagnostic result | 0.53 | Moderate uphill positive correlation coefficient relationship |
| 10 | Cholesterol | Medical diagnostic result | 0.44 | Moderate uphill positive correlation coefficient relationship |
| 11 | Hyperlipidemia | Medical diagnostic result | − 0.50 | Moderate uphill negative correlation coefficient relationship |
| 12 | HDL | Medical diagnostic result | − 0.20 | Weak uphill negative correlation coefficient relationship |
| 13 | Triglyceride | Medical diagnostic result | 0.28 | Weak uphill positive correlation coefficient relationship |
| 14 | LDL | Medical diagnostic result | 0.35 | Moderate uphill positive correlation coefficient relationship |
| 15 | Creatinine | Medical diagnostic result | 0.40 | Moderate uphill positive correlation coefficient relationship |
| 16 | Body mass | Medical diagnostic result | 0.50 | Moderate uphill positive correlation coefficient relationship |
| 17 | Heart rate | Medical diagnostic result | 0.53 | Moderate uphill positive correlation coefficient relationship |
Fig. 9The correlation coefficient analysis matrix of the dataset features
Performance evaluation result of the models
| S/N | Machine learning model | Accuracy (%) | Specificity (%) | Sensitivity (%) | ROC (%) |
|---|---|---|---|---|---|
| 1 | Logistic regression | 80.68 | 81.2 | 83.22 | 80.68 |
| 2 | Support vector machine | 88.68 | 86.34 | 87.34 | 88.63 |
| 3 | K-nearest neighbor | 82.35 | 83.76 | 84.30 | 82.95 |
| 4 | Random forest | 92.04 | 83.34 | 86.50 | 92.20 |
| 5 | Naive Bayes | 87.50 | 92.4 | 83.30 | 77.43 |
| 6 | Gradient booting | 90.90 | 91.12 | 87.20 | 90.28 |
Fig. 11Models performance evaluation result
Fig. 10Decision tree generated with random tree