| Literature DB >> 35749402 |
João Albuquerque1,2,3, Ana Margarida Medeiros3,4, Ana Catarina Alves3,4, Mafalda Bourbon3,4, Marília Antunes2,5.
Abstract
Familial Hypercholesterolemia (FH) is an inherited disorder of cholesterol metabolism. Current criteria for FH diagnosis, like Simon Broome (SB) criteria, lead to high false positive rates. The aim of this work was to explore alternative classification procedures for FH diagnosis, based on different biological and biochemical indicators. For this purpose, logistic regression (LR), naive Bayes classifier (NB), random forest (RF) and extreme gradient boosting (XGB) algorithms were combined with Synthetic Minority Oversampling Technique (SMOTE), or threshold adjustment by maximizing Youden index (YI), and compared. Data was tested through a 10 × 10 repeated k-fold cross validation design. The LR model presented an overall better performance, as assessed by the areas under the receiver operating characteristics (AUROC) and precision-recall (AUPRC) curves, and several operating characteristics (OC), regardless of the strategy to cope with class imbalance. When adopting either data processing technique, significantly higher accuracy (Acc), G-mean and F1 score values were found for all classification algorithms, compared to SB criteria (p < 0.01), revealing a more balanced predictive ability for both classes, and higher effectiveness in classifying FH patients. Adjustment of the cut-off values through pre or post-processing methods revealed a considerable gain in sensitivity (Sens) values (p < 0.01). Although the performance of pre and post-processing strategies was similar, SMOTE does not cause model's parameters to loose interpretability. These results suggest a LR model combined with SMOTE can be an optimal approach to be used as a widespread screening tool.Entities:
Mesh:
Year: 2022 PMID: 35749402 PMCID: PMC9231719 DOI: 10.1371/journal.pone.0269713
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Confusion matrix for a binary outcome (adapted from Fawcett [30]).
|
| |||
| Positive | Negative | ||
|
| Positive | True positive (TP) | False Negative (FN) |
| Negative | False Positive (FP) | True Negative (TN) | |
Comparison of biological and biochemical values between FH and non-FH patients, according to statins usage.
| Medicated patients | non-Medicated patients | |||||
|---|---|---|---|---|---|---|
| FH | non-FH | p-value | FH | non-FH | p-value | |
| n (%) | 111 (33.2) | 223 (66.8) | - | 35 (30.0) | 82 (70.0) | - |
| Gene: n (%) | ||||||
| LDLR | 104 (93.7) | - | - | 31 (88.6) | - | - |
| APOB | 5 (4.5) | - | - | 3 (8.6) | - | - |
| PCSK9 | 2 (1.8) | - | - | 1 (2.8) | - | - |
| Male: n (%) | 47 (42.3) | 107 (48.0) | 0.39 | 12 (34.3) | 40 (48.8) | 0.21 |
| Age: mean (sd) | 47.3 (14.8) | 48.2 (13.0) | 0.55 | 33.7 (12.2) | 39.7 (10.8) | <0.01 |
| BMI: mean (sd) | 25.9 (4.2) | 26.3 (3.9) | 0.16 | 25.5 (4.7) | 23.9 (3.4) | 0.15 |
| Physical signs: n (%) | 27 (24.3) | 19 (8.5) | <0.01 | 5 (14.3) | 5 (6.1) | 0.28 |
| CVD disease: n (%) | 32 (28.8) | 74 (33.2) | 0.50 | 6 (17.1) | 9 (11.0) | 0.54 |
| Age CVD: mean (sd) | 45.9 (11.8) | 47.2 (9.8) | 0.64 | 42.5 (10.3) | 34.1 (6.7) | 0.22 |
| Hypertension: n (%) | 39 (35.1) | 62 (27.8) | 0.62 | 4 (11.4) | 6 (7.3) | 0.71 |
| Smoking: n (%) | 16 (14.4) | 49 (22.0) | 0.13 | 5 (14.3) | 23 (28.0) | 0.17 |
| Cigarettes/day: mean (sd) | 11.8 (9.8) | 13.5 (7.9) | 0.29 | 14.0 (8.0) | 12.5 (8.2) | 0.72 |
| Alcohol use: n (%) | 16 (14.4) | 49 (22.0) | 0.99 | 4 (11.4) | 24 (29.3) | 0.07 |
| Alcohol units/week: mean (sd) | 10.3 (10.4) | 10.0 (8.7) | 0.67 | 2.5 (3.0) | 6.4 (6.8) | 0.14 |
| Lipid profile (in mg/ dL) | ||||||
| TC: mean (sd) | 254.0 (58.0) | 209.0 (46.0) | <0.01 | 335.0 (75.0) | 279.0 (45.0) | <0.01 |
| LDLc: mean (sd) | 176.2 (52.6) | 127.5 (41.0) | <0.01 | 256.7 (70.3) | 195.2 (41.1) | <0.01 |
| HDLc: mean (sd) | 55.2 (15.9) | 56.7 (16.1) | 0.31 | 52.7 (17.5) | 57.1 (17.5) | 0.17 |
| TG: mean (sd) | 116.3 (57.1) | 141.5 (74.5) | <0.01 | 123.9 (51.1) | 141.2 (61.6) | 0.23 |
| Lp(a): mean (sd) | 59.0 (56.2) | 59.7 (63.5) | 0.74 | 42.2 (60.7) | 42.7 (52.7) | 0.61 |
| ApoAI: mean (sd) | 151.0 (36.0) | 161.0 (35.0) | <0.01 | 146.0 (39.0) | 162.0 (40.0) | 0.05 |
| ApoB: mean (sd) | 132.4 (44.9) | 99.5 (31.4) | <0.01 | 179.9 (38.3) | 136.4 (36.1) | <0.01 |
FH: familial hypercholesterolemia; BMI: body mass index; CVD: cardiovascular disease; TC: total cholesterol; LDLc: low density lipoprotein cholesterol; HDLc: high density lipoprotein cholesterol; TG: triglycerides; Lp(a): lipoprotein(a); Apo: apolipoprotein.
Area under the ROC and PR curves for medicated patients, using the original and SMOTE sample data.
| Model |
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||||
| LR | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 |
|
| NB | 0.80 | 0.81 | 0.82 | 0.81 | 0.81 | 0.81 | 0.81 | 0.82 | 0.81 | 0.80 |
|
| RF | 0.81 | 0.82 | 0.82 | 0.82 | 0.82 | 0.81 | 0.82 | 0.82 | 0.82 | 0.82 |
|
| XGB | 0.81 | 0.83 | 0.83 | 0.82 | 0.83 | 0.81 | 0.82 | 0.83 | 0.82 | 0.82 |
|
|
| |||||||||||
| LR | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.84 | 0.83 | 0.84 | 0.84 | 0.84 |
|
| NB | 0.80 | 0.82 | 0.81 | 0.81 | 0.81 | 0.80 | 0.80 | 0.81 | 0.81 | 0.80 |
|
| RF | 0.82 | 0.82 | 0.83 | 0.82 | 0.82 | 0.81 | 0.80 | 0.82 | 0.82 | 0.82 |
|
| XGB | 0.80 | 0.82 | 0.81 | 0.82 | 0.81 | 0.81 | 0.82 | 0.83 | 0.82 | 0.82 |
|
|
| |||||||||||
| LR | 0.70 | 0.71 | 0.71 | 0.71 | 0.71 | 0.71 | 0.70 | 0.71 | 0.70 | 0.70 |
|
| NB | 0.65 | 0.67 | 0.68 | 0.66 | 0.66 | 0.66 | 0.66 | 0.67 | 0.66 | 0.65 |
|
| RF | 0.70 | 0.72 | 0.71 | 0.70 | 0.71 | 0.69 | 0.69 | 0.69 | 0.71 | 0.70 |
|
| XGB | 0.68 | 0.72 | 0.71 | 0.70 | 0.71 | 0.68 | 0.69 | 0.70 | 0.69 | 0.69 |
|
|
| |||||||||||
| LR | 0.70 | 0.70 | 0.71 | 0.71 | 0.72 | 0.71 | 0.70 | 0.71 | 0.71 | 0.71 |
|
| NB | 0.65 | 0.68 | 0.67 | 0.66 | 0.66 | 0.66 | 0.66 | 0.67 | 0.66 | 0.66 |
|
| RF | 0.70 | 0.71 | 0.70 | 0.71 | 0.70 | 0.68 | 0.68 | 0.70 | 0.72 | 0.70 |
|
| XGB | 0.66 | 0.70 | 0.69 | 0.71 | 0.67 | 0.68 | 0.68 | 0.69 | 0.69 | 0.69 |
|
AUROC: area under the receiver operating characteristics curve; AUPRC: area under the precision-recall curve; LR: logistic regression; NB: naive bayes; RF: random forest; XGB: extreme gradient boosting; SMOTE: synthetic minority oversampling technique; Each different replica from the dataset is represented by m, and M represents the combination of estimates obtained for observations in every replica.
Fig 1Comparison of areas under the ROC and PR curves, for each replica of the dataset.
On the left column are presented the results using original sample data, and on the right column the results using SMOTE sample data. AUROC: area under the receiver operating characteristics curve; AUPRC: area under the precision-recall curve; LR: logistic regression; NB: naive bayes; RF: random forest; XGB: extreme gradient boosting; SMOTE: synthetic minority oversampling technique.
Mean and standard deviation values of operating characteristics (OC), for different classification algorithms and techniques to cope with data imbalance, and values obtained with SB criteria.
| Model |
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
|
| |||||||
| LR | 0.79 (0.06) | 0.57 (0.12) | 0.89 (0.06) | 0.72 (0.14) | 0.81 (0.07) | 0.71 (0.08) | 0.63 (0.10) |
| NB | 0.79 (0.06) | 0.51 (0.13) | 0.92 (0.04) | 0.75 (0.15) | 0.80 (0.07) | 0.68 (0.09) | 0.60 (0.12) |
| RF | 0.79 (0.06) | 0.57 (0.13) | 0.89 (0.06) | 0.72 (0.13) | 0.81 (0.07) | 0.71 (0.08) | 0.63 (0.10) |
| XGB | 0.78 (0.06) | 0.57 (0.14) | 0.89 (0.05) | 0.71 (0.14) | 0.81 (0.07) | 0.71 (0.09) | 0.62 (0.12) |
|
| |||||||
| LR | 0.75 (0.06) | 0.81 (0.12) | 0.72 (0.09) | 0.58 (0.10) | 0.89 (0.07) | 0.76 (0.07) | 0.67 (0.09) |
| NB | 0.70 (0.07) | 0.75 (0.15) | 0.67 (0.12) | 0.53 (0.12) | 0.85 (0.08) | 0.70 (0.08) | 0.60 (0.11) |
| RF | 0.75 (0.07) | 0.69 (0.13) | 0.78 (0.09) | 0.60 (0.13) | 0.84 (0.07) | 0.73 (0.08) | 0.64 (0.10) |
| XGB | 0.77 (0.06) | 0.71 (0.13) | 0.81 (0.07) | 0.64 (0.13) | 0.85 (0.07) | 0.75 (0.08) | 0.66 (0.10) |
|
| |||||||
| LR | 0.76 (0.06) | 0.76 (0.11) | 0.76 (0.08) | 0.61 (0.11) | 0.87 (0.07) | 0.76 (0.07) | 0.67 (0.09) |
| NB | 0.73 (0.06) | 0.71 (0.12) | 0.75 (0.08) | 0.58 (0.13) | 0.84 (0.07) | 0.72 (0.07) | 0.63 (0.10) |
| RF | 0.76 (0.07) | 0.67 (0.13) | 0.80 (0.07) | 0.61 (0.13) | 0.83 (0.07) | 0.73 (0.09) | 0.63 (0.11) |
| XGB | 0.75 (0.07) | 0.73 (0.13) | 0.77 (0.07) | 0.60 (0.12) | 0.85 (0.07) | 0.75 (0.08) | 0.65 (0.11) |
| SB | 0.47 | 0.91 | 0.26 | 0.37 | 0.86 | 0.48 | 0.53 |
Acc: accuracy; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; SMOTE: synthetic minority oversampling technique; LR: logistic regression; NB: naive Bayes; RF: random forest; XGB: extreme gradient boosting; SB: Simon Broome criteria.
Significant differences for operating characteristics (OC) values among several classification methods.
| Model |
|
|
|
|
| ||
|
| |||||||
| LR-NB | 0.04↑ | - | - | - | - | 0.04↑ | 0.04↑ |
| LR-RF | - | 0.02↑ | - | - | - | - | - |
| LR-XGB | - | 0.04↑ | 0.02↓ | - | - | - | - |
| NB-RF | 0.01↓ | - | 0.01↓* | 0.02↓ | - | - | - |
| NB-XGB | <0.01↓* | - | <0.01↓* | <0.01↓* | - | 0.01↓ | 0.04↓ |
|
| |||||||
| LR-RF | 0.03↑ | - | 0.04↑ | - | - | - | - |
| NB-RF | - | - | 0.04↓ | - | - | - | - |
| Model |
|
|
|
|
| ||
|
| |||||||
| SB-LR | <0.01↓* | <0.02↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | 0.01↓* |
| SB-NB | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | 0.01↑ | <0.01↓* | - |
| SB-RF | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | 0.01↓* |
| SB-XGB | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | 0.02↓ |
|
| |||||||
| SB-LR | <0.01↓* | 0.01↑ | <0.01↓* | <0.01↓* | - | <0.01↓* | <0.01↓* |
| SB-NB | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | 0.04↓ |
| SB-RF/XGB | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | <0.01↓* |
|
| |||||||
| SB-LR/XGB | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | <0.01↓* |
| SB-NB/RF | <0.01↓* | <0.01↑* | <0.01↓* | <0.01↓* | - | <0.01↓* | 0.01↓* |
| Model |
|
|
|
|
| ||
|
| |||||||
| 0.5— | - | <0.01↓* | <0.01↑* | <0.01↑* | <0.01↓* | - | - |
| 0.5— | - | <0.01↓* | <0.01↑* | <0.01↑* | <0.01↓* | - | - |
| | - | - | 0.03↓ | - | - | - | - |
|
| |||||||
| 0.5— | <0.01↑* | <0.01↓* | <0.01 ↑ * | <0.01↑* | <0.01↓* | - | - |
| 0.5— | 0.02↑ | <0.01↓* | <0.01↑* | <0.01↑* | <0.01↓* | - | - |
| | 0.04↑ | - | <0.01↓* | 0.04↓ | - | - | - |
|
| |||||||
| 0.5— | - | <0.01↓* | <0.01↑* | <0.01↑* | 0.01↓* | - | - |
| 0.5— | - | <0.01↓* | <0.01↑* | <0.01↑* | - | - | - |
|
| |||||||
| 0.5— | - | <0.01↓* | <0.01↑* | 0.01↑* | 0.01↓* | 0.04↓ | - |
| 0.5— | - | <0.01↓* | <0.01↑* | <0.01↑* | 0.01↓* | - | - |
| | - | - | 0.04↑ | - | - | - | - |
Significant differences are signalled with an ↑ or ↓, depending on whether the first model performs better or worst than the second. If *, differences are still significant after applying Bonferroni correction. If -, non-significant for p < 0.05; Acc: accuracy; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; SMOTE: synthetic minority oversampling technique; YI: Youden Index; LR: logistic regression; NB: naive Bayes; RF: random forest; XGB: extreme gradient boosting; SB: Simon Broome criteria. Non-reported pairwise comparisons did not present any significant difference.
Fig 2Comparison of operating characteristics values between different classification algorithms, and strategies to deal with data imbalance.
The dashed line represents the value obtained when applying SB criteria. Acc: accuracy; Sens: sensitivity; Spec: specificity; PPV: positive predictive value; NPV: negative predictive value; SMT: synthetic minority oversampling technique; YI: Youden Index; LR: logistic regression; NB: naive Bayes; RF: random forest; XGB: extreme gradient boosting; SB: Simon Broome criteria.