| Literature DB >> 30717373 |
Sangwoo Lee1, Eun Kyung Choe2,3, Boram Park4.
Abstract
BACKGROUND: Machine learning (ML) is a promising methodology for classification and prediction applications in healthcare. However, this method has not been practically established for clinical data. Hyperuricemia is a biomarker of various chronic diseases. We aimed to predict uric acid status from basic healthcare checkup test results using several ML algorithms and to evaluate the performance.Entities:
Keywords: machine learning; prediction; uric acid
Year: 2019 PMID: 30717373 PMCID: PMC6406925 DOI: 10.3390/jcm8020172
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.241
Compared machine learning algorithms.
| No. | Machine Learning Scheme | Method in Detail | Data Splitting Method |
|---|---|---|---|
| 1 | Discrimination analysis classification (DAC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
| 2 | k-nearest neighbor classification (KNNC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
| 3 | Naïve Bayes classification (NBC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
| 4 | Support vector machine classification (SVMC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
| 5 | Decision tree classification (DTC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
| 6 | Random forest classification (RFC) | K-fold cross validation with k = 5 | Training set ratio = 0.7, test set ratio = 0.3 |
Performance measures and their definitions.
| Notation | Description | Upper Bound |
|---|---|---|
| Accuracy | (TP + TN)/(TP + FN + FP + TN) | 1 when FN = 0 and FP = 0 |
| Sensitivity (Recall, True positive rate) | TP/(TP + FN) | 1 when FN = 0 |
| Specificity (True negative rate) | TN/(FP + TN) | 1 when FP = 0 |
| Precision | TP/(TP + FP) | 1 when FP = 0 |
| Balanced classification rate | (SN × SP)1/2 | 1 when SN = 1 and SP = 1 |
| F1-score | (2 × SN × Precision)/(SN + Precision) | 1 when SN = 1 and Precision = 1 |
TP: true positive; TN: true negative; FP: false positive; FN: false negative; SN: sensitivity; and SP: specificity.
Demographics features of the included population.
| Normal Uric Acid Level ( | Hyperuricemia ( |
| |
|---|---|---|---|
| Sex ( | |||
| Male | 19,540 (64.5%) | 6764 (87.8%) | <0.001 |
| Female | 10,756 (35.5%) | 941 (12.2%) | |
| Age | 52.1 ± 9.4 | 50.7 ± 9.6 | <0.001 |
| Systolic blood pressure | 116.6 ± 13.9 | 120.0 ± 13.3 | <0.001 |
| Diastolic blood pressure | 75.6 ± 10.8 | 79.2 ± 10.7 | <0.001 |
| Height (cm) | 166.2 ± 8.0 | 169.3 ± 7.1 | <0.001 |
| Weight (kg) | 65.2 ± 11.0 | 72.4 ± 10.9 | <0.001 |
| Body mass index (m2/kg) | 23.5 ± 2.8 | 25.2 ± 2.9 | <0.001 |
| Waist circumference | 84.7 ± 7.9 | 89.4 ± 7.7 | <0.001 |
| White blood cell count (cells/mL) | 5.4 ± 1.5 | 5.9 ± 1.7 | <0.001 |
| Hemoglobin (g/dL) | 14.4 ± 1.5 | 15.1 ± 1.3 | <0.001 |
| Glucose (mg/dL) | 97.6 ± 19.5 | 99.0 ± 18.2 | <0.001 |
| Total cholesterol (mg/dL) | 193.1 ± 34.2 | 200.8 ± 36.0 | <0.001 |
| GOT (IU/L) | 24.4 ± 14.8 | 28.5 ± 16.7 | <0.001 |
| GPT (IU/L) | 25.8 ± 24.6 | 33.9 ± 24.9 | <0.001 |
| GGT (IU/L) | 36.0 ± 42.7 | 55.3 ± 63.8 | <0.001 |
| Creatinine (mg/dL) | 0.9 ± 0.2 | 1.0 ± 0.2 | <0.001 |
| Triglyceride (mg/dL) | 108.0 ± 69.9 | 144.8 ± 95.6 | <0.001 |
| HDL cholesterol (mg/dL) | 53.3 ± 12.6 | 49.3 ± 11.1 | <0.001 |
| LDL cholesterol (mg/dL) | 121.8 ± 28.9 | 129.4 ± 31.1 | <0.001 |
| Urine albumin, Positive ( | 363 (1.2%) | 203 (2.6%) | <0.001 |
| Smoking | <0.001 | ||
| None | 14,274 (47.1%) | 2198 (28.5%) | |
| Ex-smoker | 9891 (32.6%) | 3375 (43.8%) | |
| Current smoker | 6131 (20.2%) | 2132 (27.7%) | |
| Alcohol, Heavy ( | 16,236 (53.6%) | 5298 (68.8%) | <0.001 |
| Diabetes, Yes ( | 2311 (7.6%) | 508 (6.6%) | 0.002 |
| Hypertension, Yes ( | 6003 (19.8%) | 2169 (28.2%) | <0.001 |
| Dyslipidemia, Yes ( | 4765 (15.7%) | 1531 (19.9%) | <0.001 |
Comparison of model performance for maximum sensitivity criterion and maximum BCR criterion.
| Model | Training Set | Test Set | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | SN | SP | BCR | Precision | F1 Score | Accuracy | SN | SP | BCR | Precision | F1 Score | |
| For maximum sensitivity criterion | ||||||||||||
| DAC | 0.70 | 0.58 | 0.73 | 0.65 | 0.35 | 0.44 | 0.70 | 0.59 | 0.73 | 0.65 | 0.37 | 0.45 |
| KNNC | 1 | 1 | 1 | 1 | 1 | 1 | 0.72 | 0.34 | 0.82 | 0.53 | 0.34 | 0.34 |
| NBC | 0.62 | 0.73 | 0.60 | 0.66 | 0.31 | 0.44 | 0.63 | 0.73 | 0.60 | 0.66 | 0.33 | 0.45 |
| SVMC | 0.53 | 0.48 | 0.54 | 0.51 | 0.21 | 0.29 | 0.52 | 0.48 | 0.54 | 0.51 | 0.22 | 0.30 |
| DTC | 0.80 | 0.10 | 0.97 | 0.31 | 0.52 | 0.17 | 0.78 | 0.08 | 0.97 | 0.28 | 0.49 | 0.14 |
| RFC | 0.78 | 0.88 | 0.75 | 0.81 | 0.47 | 0.61 | 0.68 | 0.66 | 0.69 | 0.67 | 0.36 | 0.47 |
| For maximum BCR criterion | ||||||||||||
| DAC | 0.70 | 0.58 | 0.73 | 0.65 | 0.35 | 0.44 | 0.70 | 0.59 | 0.73 | 0.65 | 0.37 | 0.45 |
| KNNC | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.72 | 0.34 | 0.82 | 0.53 | 0.34 | 0.34 |
| NBC | 0.62 | 0.73 | 0.60 | 0.66 | 0.31 | 0.44 | 0.63 | 0.73 | 0.60 | 0.66 | 0.33 | 0.45 |
| SVMC | 0.53 | 0.48 | 0.54 | 0.51 | 0.21 | 0.29 | 0.52 | 0.48 | 0.54 | 0.51 | 0.22 | 0.30 |
| DTC | 0.80 | 0.10 | 0.97 | 0.31 | 0.52 | 0.17 | 0.78 | 0.08 | 0.97 | 0.28 | 0.49 | 0.14 |
| RFC | 0.73 | 0.71 | 0.73 | 0.72 | 0.40 | 0.51 | 0.70 | 0.64 | 0.71 | 0.68 | 0.37 | 0.47 |
SN: sensitivity; SP: specificity; BCR: balanced classification rate; DAC: discriminant analysis classification; KNNC: K-nearest neighbor classification; NBC: naïve Bayes classification; SVMC: support vector machine classification; DTC: decision tree classification; and RFC: random forest classification.
Performance comparison with conventional logistic regression model for total set (maximum sensitivity criterion).
| AUC | 95% Confidence Interval | ||
|---|---|---|---|
| CLR | 0.568 | 0.563–0.572 | Reference |
| NBC | 0.669 | 0.663–0.675 | <0.001 |
| RFC | 0.775 | 0.770–0.780 | <0.001 |
| DAC | 0.661 | 0.655–0.667 | <0.001 |
| KNNC | 0.8723 | 0.868–0.877 | <0.001 |
| SVMC | 0.515 | 0.509–0.522 | <0.001 |
| DTC | 0.537 | 0.534–0.541 | <0.001 |
CLR: conventional logistic regression; NBC: naïve Bayes classification; RFC: random forest classification; DAC: discriminant analysis classification; KNNC: K-nearest neighbor classification; SVMC: support vector machine classification; DTC: decision tree classification; and AUC: area under the curve.