| Literature DB >> 30026888 |
Hamid R Marateb1,2, Mohammad Reza Mohebian1, Shaghayegh Haghjooy Javanmard3, Amir Ali Tavallaei1, Mohammad Hasan Tajadini4, Motahar Heidari-Beni5, Miguel Angel Mañanas2,6, Mohammad Esmaeil Motlagh7, Ramin Heshmat8, Marjan Mansourian3,9, Roya Kelishadi10.
Abstract
Dyslipidemia, the disorder of lipoprotein metabolism resulting in high lipid profile, is an important modifiable risk factor for coronary heart diseases. It is associated with more than four million worldwide deaths per year. Half of the children with dyslipidemia have hyperlipidemia during adulthood, and its prediction and screening are thus critical. We designed a new dyslipidemia diagnosis system. The sample size of 725 subjects (age 14.66 ± 2.61 years; 48% male; dyslipidemia prevalence of 42%) was selected by multistage random cluster sampling in Iran. Single nucleotide polymorphisms (rs1801177, rs708272, rs320, rs328, rs2066718, rs2230808, rs5880, rs5128, rs2893157, rs662799, and Apolipoprotein-E2/E3/E4), and anthropometric, life-style attributes, and family history of diseases were analyzed. A framework for classifying mixed-type data in imbalanced datasets was proposed. It included internal feature mapping and selection, re-sampling, optimized group method of data handling using convex and stochastic optimizations, a new cost function for imbalanced data and an internal validation. Its performance was assessed using hold-out and 4-foldcross-validation. Four other classifiers namely as supported vector machines, decision tree, and multilayer perceptron neural network and multiple logistic regression were also used. The average sensitivity, specificity, precision and accuracy of the proposed system were 93%, 94%, 94% and 92%, respectively in cross validation. It significantly outperformed the other classifiers and also showed excellent agreement and high correlation with the gold standard. A non-invasive economical version of the algorithm was also implemented suitable for low- and middle-income countries. It is thus a promising new tool for the prediction of dyslipidemia.Entities:
Keywords: Computer-assisted diagnosis; Deep learning; Dyslipidemia; Genomics; Health promotion; Machine learning
Year: 2018 PMID: 30026888 PMCID: PMC6050175 DOI: 10.1016/j.csbj.2018.02.009
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Supplementary material S2The flowchart of the proposed framework for classifying mixed-type data in imbalanced datasets
The classification performance measures used in our study.
| Kappa = agreement rate |
True positive (TP): subjects with dyslipidemia, correctly identified; false positive (FP): subjects without dyslipidemia, incorrectly identified; true negative (TN): subjects without dyslipidemia, correctly identified; false negative (FN): subjects with dyslipidemia, incorrectly identified; Se: sensitivity; Rl: recall; Sp: specificity; FA: false alarm; Acc: accuracy; Pr: precision; F1S: F1-Score; AUC: area under the receiver operating characteristic (ROC) curve; LR: likelihood ratio; DOR: diagnosis odds ratio; MCC: Matthews correlation coefficient; DP: discriminant power; Kappa: Cohen's kappa coefficient defined as the agreement rate between the predicted class labels and the gold standard.
Characteristics of the participants in the dyslipidemia and normal groups.
| Dyslipidemia⁎ | |||||
|---|---|---|---|---|---|
| Predictors | Categories | No | Yes | OR [CI 95%] | P-value |
| Age (years) | 14.28 ± 2.26 | 14.64 ± 2.39 | – | 0.058 | |
| Sex | Male | 49.28 | 46.61 | 0.90 [0.67,1.21] | 0.477 |
| Female | – | ||||
| Region | Urban | 64.80 | 71.71 | 1.38 [1.01,1.89] | 0.049 |
| Rural | – | ||||
| Family history of diabetes | No | 70.54 | 66.14 | – | 0.207 |
| Yes | 1.23 [0.89,1.68] | ||||
| Family history of obesity | No | 68.32 | 70.12 | – | 0.604 |
| Yes | 0.92 [0.67,1.27] | ||||
| Family history of cancer | No | 83.23 | 78.88 | – | 0.137 |
| Yes | 1.33[0.91,1.93] | ||||
| Family history of CVD | No | 87.16 | 92.43 | – | 0.023 |
| Yes | 0.55 [0.33,0.93] | ||||
| Abdominal obesity | No | 88.41 | 61.59 | – | <0.001 |
| Yes | 4.76 [3.26,6.94] | ||||
| BMI category (WHO criteria) | Under weight | 25.85 | 19.52 | 0.76 [0.52,1.09] | 0.007 |
| Normal | 58.22 | 58.17 | - | ||
| Over weight | 8.36 | 10.76 | 1.29 [0.77,2.15] | ||
| Obese | 7.57 | 11.55 | 1.53 [0.91,2.56] | ||
| Physical activity | Mild | 25.47 | 45.82 | 2.03 [1.43,2.87] | <0.001 |
| Moderate | 40.37 | 35.86 | – | ||
| High | 34.16 | 18.32 | 0.60 [0.40,0.89] | ||
| Birth weight | Low | 11.67 | 16.73 | 1.54 [1.0,2.34] | 0.249 |
| Normal | 79.58 | 74.10 | – | ||
| High | 8.75 | 9.17 | 1.13 [0.67,1.89] | ||
| Systolic blood pressure (mm Hg) | 101.87 ± 13.16 | 104.16 ± 13.09 | – | 0.025 | |
| Diastolic blood pressure (mm Hg) | 65.89 ± 10.74 | 66.69 ± 10.61 | – | 0.338 | |
| Fast blood sugar (mg/dL) | 87.6 ± 11.85 | 84.32 ± 11.85 | – | 0.002 | |
| HDL-C (mg/dL) | 59.95 ± 18.22 | 29.40 ± 12.37 | – | <0.001 | |
| LDL-C (mg/dL) | 75.43 ± 28.35 | 92.55 ± 38.09 | – | <0.001 | |
| Total cholesterol (mg/dL) | 149.66 ± 29.50 | 154.46 ± 30.20 | – | 0.061 | |
| Triglyceride (mg/dL) | 86.06 ± 33.08 | 93.35 ± 34.35 | – | <0.001 | |
*: Results are reported as mean ± standard deviation (for interval variables) and percentage (for categorical variables). CVD: cardio-vascular disease; BMI: body mass index; WHO: world health organization; HDL-C: high-density lipoprotein cholesterol; LDL-C: low-density lipoprotein cholesterol; OR: Odds ratio (a categorical level was set to reference for each categorical variable); CI: confidence interval. In each dyslipidemia group, the frequency percentage of one of the categories in binary variables was shown.
SNP genotype and allele frequencies (in percentage) of the participants in the dyslipidemia and normal groups.
| Polymorphism | Genotype and allele⁎ | Dyslipidemia | OR [CI 95%] | P-value | |
|---|---|---|---|---|---|
| No | Yes | ||||
| LPL D9N [rs1801177] | AA | 96.4 | 91.2 | – | 0.003 |
| AG | 2.59 [1.35–4.96] | ||||
| ABCAI V771M [rs2066718] | GG | 94.0 | 98.7 | - | 0.002 |
| GA | 0.21 [0.07–0.60] | ||||
| LPL HindIII [rs320] | GG | 24.4 | 50.8 | – | <0.001 |
| GT | 48.6 | 42.0 | 0.31 [0.23–0.43] | ||
| TT | 27.0 | 7.2 | |||
| LPL S447X [rs328] | CC | 72.7 | 88.6 | – | <0.001 |
| CG | 24.6 | 10.4 | 0.34 [0.23–0.52] | ||
| GG | 2.6 | 1.0 | |||
| ABCAI R1587K [rs2230808] | AA | 66.7 | 47.6 | – | <0.001 |
| AG | 29.9 | 39.4 | 2.21 [1.64–3.00] | ||
| GG | 3.3 | 13.0 | |||
| CETP TaqIB [rs708272] | CC | 19.1 | 60.6 | – | <0.001 |
| CT | 61.7 | 35.5 | 0.15 [0.11–0.22] | ||
| TT | 19.1 | 3.9 | |||
| APOC3 | CC | 83.0 | 83.7 | – | 0.371 |
| CG | 16.7 | 15.3 | 0.95 [0.64–1.41] | ||
| GG | 0.2 | 1.0 | |||
| CETP A373P [rs5880] | CC | 93.5 | 77.9 | – | <0.001 |
| CG | 6.5 | 20.8 | 4.12 [2.56–6.62] | ||
| GG | 0.0 | 1.3 | |||
| APOA1 MspI [rs2893157] | GG | 69.4 | 74.3 | – | 0.119 |
| GA | 27.8 | 24.8 | 0.79 [0.56–1.09] | ||
| AA | 2.9 | 1.0 | |||
| APOA5 C-1131T [rs662799] | CC | 98.8 | 97.7 | – | 0.525 |
| CT | 0.5 | 1.0 | 1.93 [0.61–6.13] | ||
| TT | 0.7 | 1.3 | |||
| ApoE | e2 | 6.9 | 0.7 | 1.73 [1.08–2.76] | <0.001 |
| e4 | 1.7 | 13.4 | |||
| e3 | 91.4 | 86.0 | – | ||
*: The alleles GG (SNP rs1801177) and CC (SNP rs2066718) had zero frequency in both normal and dyslipidemia groups and thus not shown in the results. OR: Odds ratio (a categorical level was set to reference for each categorical variable); CI: confidence interval. In each dyslipidemia group, the frequency percentage of one of the categories in binary variables was shown.
The hold-out (50%) validation of the classifiers.
| Feature subset | Classifier | Se | Sp | Acc | F1S | Pr | FA | AUC | MCC | DOR | DP | Kappa |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Proposed | 85 | 91 | 88 | 86 | 87 | 0.09 | 0.88 | 0.76 | 57 | 1.0 | 0.76 |
| DT | 69 | 80 | 75 | 70 | 72 | 0.20 | 0.75 | 0.47 | 9 | 0.5 | 0.46 | |
| MLP | 67 | 88 | 79 | 73 | 80 | 0.12 | 0.78 | 0.56 | 15 | 0.6 | 0.56 | |
| MLR | 61 | 86 | 75 | 68 | 76 | 0.14 | 0.74 | 0.49 | 10 | 0.5 | 0.49 | |
| SVM | 71 | 78 | 75 | 70 | 70 | 0.22 | 0.75 | 0.45 | 9 | 0.5 | 0.44 | |
| 2 | Proposed | 93 | 95 | 94 | 93 | 93 | 0.05 | 0.94 | 0.87 | 252 | 1.3 | 0.87 |
| DT | 71 | 81 | 77 | 72 | 73 | 0.19 | 0.76 | 0.50 | 10 | 0.6 | 0.50 | |
| MLP | 70 | 86 | 79 | 74 | 79 | 0.14 | 0.78 | 0.57 | 14 | 0.6 | 0.57 | |
| MLR | 59 | 87 | 75 | 67 | 77 | 0.13 | 0.73 | 0.48 | 10 | 0.5 | 0.47 | |
| SVM | 71 | 82 | 77 | 72 | 74 | 0.18 | 0.77 | 0.52 | 11 | 0.6 | 0.52 | |
| 3 | Proposed | 82 | 84 | 83 | 80 | 79 | 0.16 | 0.83 | 0.64 | 24 | 0.8 | 0.64 |
| DT | 48 | 68 | 60 | 50 | 52 | 0.32 | 0.58 | 0.12 | 2 | 0.2 | 0.10 | |
| MLP | 17 | 93 | 61 | 27 | 64 | 0.07 | 0.55 | 0.16 | 3 | 0.2 | 0.13 | |
| MLR | 17 | 94 | 61 | 27 | 68 | 0.06 | 0.56 | 0.18 | 3 | 0.3 | 0.14 | |
| SVM | 61 | 68 | 65 | 59 | 58 | 0.32 | 0.65 | 0.17 | 3 | 0.3 | 0.12 |
Set 1 included sex, analyzed SNPs and family history of diseases: sex, LPL D9N [rs1801177], ABCAI V771M [rs2066718], LPL HindIII [rs320], LPL S447X [rs328], ABCAI R1587K [rs2230808], CETP TaqIB [rs708272], APOC3 SstI [rs5128], CETP A373P [rs5880], APOA1 MspI [rs2893157], APOA5 C-1131T [rs662799], ApoE, Family history of diabetes, obesity, cancer, and CVD. Set 2 included Set 1 and birth weight, age, and physical activity. Set 3 included sex, age, physical activity, birth weight, BMI category, abdominal obesity, family history of diabetes, obesity, cancer, and CVD. The classifiers were trained on the same training set and then validated on the test set and the results of the classifiers on the test set were shown.
Non-significant (P-value > 0.05).
The four-fold cross validation results of the proposed prediction system in MEAN ± SD.
| Feature subset | Se | Sp | Acc | Pr |
|---|---|---|---|---|
| 1 | 87± 2 | 90 ± 1 | 89 ± 1 | 86 ± 1 |
| 2 | 93± 2 | 94 ± 1 | 94 ± 1 | 92 ± 1 |
| 3 | 83± 2 | 84 ± 2 | 84 ± 1 | 79 ± 2 |
Se: sensitivity; Sp: specificity; Acc: accuracy; Pr: precision.