| Literature DB >> 35891045 |
Elias Dritsas1, Maria Trigka1.
Abstract
Cholesterol is a waxy substance found in blood lipids. Its role in the human body is helpful in the process of producing new cells as long as it is at a healthy level. When cholesterol exceeds the permissible limits, it works the opposite, causing serious heart health problems. When a person has high cholesterol (hypercholesterolemia), the blood vessels are blocked by fats, and thus, circulation through the arteries becomes difficult. The heart does not receive the oxygen it needs, and the risk of heart attack increases. Nowadays, machine learning (ML) has gained special interest from physicians, medical centers and healthcare providers due to its key capabilities in health-related issues, such as risk prediction, prognosis, treatment and management of various conditions. In this article, a supervised ML methodology is outlined whose main objective is to create risk prediction tools with high efficiency for hypercholesterolemia occurrence. Specifically, a data understanding analysis is conducted to explore the features association and importance to hypercholesterolemia. These factors are utilized to train and test several ML models to find the most efficient for our purpose. For the evaluation of the ML models, precision, recall, accuracy, F-measure, and AUC metrics have been taken into consideration. The derived results highlighted Soft Voting with Rotation and Random Forest trees as base models, which achieved better performance in comparison to the other models with an AUC of 94.5%, precision of 92%, recall of 91.8%, F-measure of 91.7% and an accuracy equal to 91.75%.Entities:
Keywords: cholesterol; data analysis; hypercholesterolemia; long-term prediction; machine learning
Mesh:
Year: 2022 PMID: 35891045 PMCID: PMC9322993 DOI: 10.3390/s22145365
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Statistical Description of the Numerical Features in the Balanced Dataset.
| Min | Max | Mean ± stdv | |
|---|---|---|---|
|
| 50 | 85 | 66.4 ± 9.5 |
|
| 18.3 | 53.1 | 28.61 ± 5.02 |
|
| 70 | 148.6 | 101.76 ± 13.18 |
|
| 90 | 201 | 136.6 ± 20.5 |
|
| 13 | 108 | 70.27 ± 12.22 |
|
| 19 | 114 | 50.94 ± 16.42 |
|
| 51 | 328 | 157.6 ± 40.1 |
|
| 75 | 360 | 208.49 ± 39.69 |
Figure 1Pearson correlation analysis.
Features’ order of importance in the balanced data.
| Feature | Pearson | Feature | Gain | Feature | InfoGain | Feature | Random |
|---|---|---|---|---|---|---|---|
| TotChol | 0.6777 | TotChol | 0.3061 | TotChol | 0.5633 | TotChol | 0.3790 |
| LDL | 0.6152 | LDL | 0.2171 | LDL | 0.3963 | LDL | 0.3165 |
| HDL | 0.1366 | DiasBP | 0.1142 | DiasBP | 0.0283 | DiasBP | 0.0788 |
| DiasBP | 0.1148 | Gender | 0.0085 | Gender | 0.0085 | Age | 0.0512 |
| BMI | 0.1106 | Alcohol | 0.0079 | Alcohol | 0.0079 | BMI | 0.0262 |
| Gender | 0.1038 | Hypertension | 0.0034 | Physical Activity | 0.0043 | Alcohol | 0.0242 |
| Alcohol | 0.1042 | Physical Activity | 0.0029 | Hypertension | 0.0034 | HDL | 0.0182 |
| Age | 0.0711 | Diabetes | 0.0027 | Diabetes | 0.0019 | SysBP | 0.0154 |
| Hypertension | 0.0681 | SysBP | 0 | SysBP | 0 | Waist | 0.0151 |
| Physical | 0.0586 | HDL | 0 | HDL | 0 | Gender | 0.0145 |
| Diabetes | 0.0520 | BMI | 0 | BMI | 0 | Hypertension | 0.0124 |
| SysBP | 0.0502 | Waist | 0 | Waist | 0 | Diabetes | 0.0000 |
| Waist | 0.0192 | Age | 0 | Age | 0 | Physical | −0.0021 |
Figure 2Participants’ distribution per age group and gender type in the balanced dataset.
Figure 3Participants’ distribution in terms of BMI and waist categories in the balanced dataset.
Figure 4Participants’ distribution for both diabetes and hypertension in the balanced dataset.
Figure 5Participants’ distribution in terms of alcohol consumption in the balanced dataset.
Figure 6Participants’ distribution in terms of physical activity in the balanced dataset.
Figure 7Ensemble Learners: Soft Voting and Stacking.
Machine Learning Models’ Settings.
| Model | Parameters |
|---|---|
| NB | useKernerEstimator = false |
| LR | ridge = |
| LMT | LR modesl at leaves |
| DT | noPruning: false, MinVarianceProp = 0.001 |
| RotF (using J48) | confidence_factor: 0.25, unpruned: false |
| RF | max_depth = 0, numIterations = 100 numFeatures = 0 |
| ANN | hidden layers: ‘a’, learning rate: 0.3 momentum factor 0.2, |
| SVM | kernel type: linear |
| K-NN | K = 3, 5 |
| Stacking | Base Models:RF, RotF Meta-model:LR |
| Soft Voting | Base Models:RF, RotF Average Probabilities |
Performance Evaluation of ML Models.
| Accuracy | Precision | Recall | F-Measure | AUC | |
|---|---|---|---|---|---|
|
| 87.37% | 0.877 | 0.874 | 0.873 | 0.931 |
|
| 88.40% | 0.884 | 0.884 | 0.884 | 0.884 |
|
| 87.63% | 0.876 | 0.876 | 0.876 | 0.927 |
|
| 82.73% | 0.828 | 0.827 | 0.827 | 0.912 |
|
| 70.62% | 0.707 | 0.706 | 0.706 | 0.758 |
|
| 90.98% | 0.911 | 0.910 | 0.910 | 0.939 |
|
| 86.85% | 0.869 | 0.869 | 0.869 | 0.928 |
|
| 89.69% | 0.900 | 0.897 | 0.897 | 0.943 |
|
| 88.92% | 0.892 | 0.889 | 0.889 | 0.902 |
|
| 91.24% | 0.915 | 0.912 | 0.912 | 0.937 |
|
| 91.75% | 0.920 | 0.918 | 0.917 | 0.945 |
Performance Comparison of ML Models.
| Recall | Accuracy | |||
|---|---|---|---|---|
| Proposed models | [ | Proposed models | [ | |
|
| 87.40% | 68.90% | 87.37% | 62.69% |
|
| 88.40% | 72.70% | 88.40% | 59.51% |
|
| 82.70% | 66.70% | 82.73% | 61.42% |
|
| 67.30% | 67.70% | 67.27% | 56.56% |
|
| 91% | 69.60% | 90.98% | 61.86% |
|
| 88.90% | 72.20% | 88.92% | 61.39% |
|
| 86.90% | 73.50% | 86.85% | 62.99% |
|
| 89.70% | 68.80% | 89.69% | 61.36% |