| Literature DB >> 35054370 |
I-Jung Tsai1, Wen-Chi Shen2, Chia-Ling Lee3, Horng-Dar Wang2,4,5, Ching-Yu Lin1,6.
Abstract
Bladder cancer has been increasing globally. Urinary cytology is considered a major screening method for bladder cancer, but it has poor sensitivity. This study aimed to utilize clinical laboratory data and machine learning methods to build predictive models of bladder cancer. A total of 1336 patients with cystitis, bladder cancer, kidney cancer, uterus cancer, and prostate cancer were enrolled in this study. Two-step feature selection combined with WEKA and forward selection was performed. Furthermore, five machine learning models, including decision tree, random forest, support vector machine, extreme gradient boosting (XGBoost), and light gradient boosting machine (GBM) were applied. Features, including calcium, alkaline phosphatase (ALP), albumin, urine ketone, urine occult blood, creatinine, alanine aminotransferase (ALT), and diabetes were selected. The lightGBM model obtained an accuracy of 84.8% to 86.9%, a sensitivity 84% to 87.8%, a specificity of 82.9% to 86.7%, and an area under the curve (AUC) of 0.88 to 0.92 in discriminating bladder cancer from cystitis and other cancers. Our study provides a demonstration of utilizing clinical laboratory data to predict bladder cancer.Entities:
Keywords: bladder cancer; clinical laboratory data; feature selection; machine learning
Year: 2022 PMID: 35054370 PMCID: PMC8774436 DOI: 10.3390/diagnostics12010203
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
The missing data rate in the clinical laboratory data.
| Feature | Data Type | Missing Data | Missing Data (%) |
|---|---|---|---|
| A/G Ratio | Continuous | 0.441 | 44.1 |
| Albumin | Continuous | 0.224 | 22.4 |
| ALP | Continuous | 0.378 | 37.8 |
| ALT | Continuous | 0.068 | 6.8 |
| AST | Continuous | 0.046 | 4.6 |
| BUN | Continuous | 0.018 | 1.8 |
| Calcium | Continuous | 0.428 | 42.8 |
| Chloride | Continuous | 0.084 | 8.4 |
| Creatinine | Continuous | 0.005 | 0.5 |
| Direct Bilirubin | Continuous | 0.303 | 30.3 |
| Estimated GFR | Continuous | 0.008 | 0.8 |
| Glucose AC | Continuous | 0.038 | 3.8 |
| Nitrite | Categorical | 0.026 | 2.6 |
| Urine occult Blood | Categorical | 0.026 | 2.6 |
| pH | Continuous | 0.026 | 2.6 |
| Potassium | Continuous | 0.035 | 3.5 |
| Sodium | Continuous | 0.037 | 3.7 |
| Specific Gravity | Continuous | 0.026 | 2.6 |
| Strip WBC | Continuous | 0.16 | 16 |
| Total Bilirubin | Continuous | 0.216 | 21.6 |
| Total Cholesterol | Continuous | 0.19 | 19 |
| Total Protein | Continuous | 0.285 | 28.5 |
| Triglyceride | Continuous | 0.204 | 20.4 |
| Urine epitheilum (UL) | Continuous | 0.43 | 43 |
| Urine epithelium count | Continuous | 0.02 | 2 |
| Uric acid | Continuous | 0.15 | 15 |
| Urine Bilirubin | Categorical | 0.026 | 2.6 |
| Urine Glucose | Categorical | 0.16 | 16 |
| Urine Ketone | Categorical | 0.16 | 16 |
| Urine Protein | Categorical | 0.026 | 2.6 |
| Urobilinogen | Categorical | 0.026 | 2.6 |
A/G Ratio: albumin globulin ratio. BUN: blood urea nitrogen.
The parameters of machine learning and feature selection.
| Algorithm Name | Parameter Name | Parameter Value |
|---|---|---|
| InfoGainAttributeEval | binarizeNumericAttributes | False |
| doNotCheckCapabilities | False | |
| missingMerge | True | |
| Ranker | generateRanking | True |
| numToSelect | −1 | |
| Decision tree | criterion of tree | gini |
| depth of tree | 4 | |
| Random forest | criterion of tree | gini |
| estimators | 300 | |
| SVM | kernel | rbf |
| C value | 1000 | |
| gamma | 0.000001 | |
| XGBoost | eta | 0.2 |
| depth of tree | 7 | |
| LightGBM | number of leaf | 100 |
| depth of tree | 1 |
The confusion matrix for evaluation of model performance.
| Patients | |||
|---|---|---|---|
| Bladder Cancer | Cystitis | ||
|
| bladder cancer | true positive, TP | false positive, FP |
| cystitis | false negative, FN | true negative, TN | |
Comparison of clinical characteristics and clinical laboratory data between patients with cystitis and patients with other cancers.
| Cystitis | Kidney Cancer | Prostate Cancer | Bladder Cancer | Uterus Cancer | |
|---|---|---|---|---|---|
| age | 60.12 ± 11.99 | 63.41 ± 10.45 ** | 71.83 ± 6.42 ** | 66.73 ± 9.4 ** | 60.86 ± 10.26 |
| sex | 88 (61.1%) | 138 (69%) | 201 (100%) ** | 386 (65.3%) | 0 ** |
| hypertension | 34 (23.7%) | 72 (36%) * | 66 (32.8%) | 173 (29.3%) | 22 (11%) * |
| diabetes | 20 (13.9%) | 46 (23%) * | 33 (16.4%) | 93 (15.7%) | 8 (4%) ** |
| smoking | 18 (12.5%) | 51 (25.5%) * | 47 (23.4%) * | 138 (23.4%) * | 9 (4.5%) * |
| drinking | 23 (16%) | 41 (20.5%) | 64 (31.8%) ** | 118 (20%) | 17 (8.5%) ** |
| beetle nuts | 2 (1.4%) | 3 (1.5%) | 3 (1.5%) | 20 (3.4%) | 1 (0.5%) |
| family history | 1 (0.7%) | 7 (3.5%) | 5 (2.5%) | 12 (2%) | 7 (3.5%) |
| A/G Ratio | - | 1.75 ± 0.4 | 1.07 ± 0.46 | 1.61 ± 0.47 | 1.64 ± 0.38 |
| Albumin | 3.96 ± 0.64 | 4.27 ± 0.57 ** | 4 ± 0.68 | 4 ± 0.68 | 4.15 ± 0.58 * |
| ALP | 71 (55, 91) | 66 (52, 79) ** | 66 (60, 72) ** | 69 (55, 90) ** | 65.5 (53, 79) |
| ALT | 30.12 ± 42.66 | 31.46 ± 35.99 | 29.94 ± 32.49 | 27.45 ± 31.9 | 25.65 ± 28.02 |
| AST | 28.31 ± 32.07 | 30.56 ± 29.51 | 43.13 ± 74.83 * | 40.65 ± 293.66 | 29.33 ± 53.46 |
| BUN | 14 (11, 21) | 16 (12, 21) ** | 17 (13, 22.9) * | 16 (12, 26) ** | 12 (9, 16.85) * |
| Calcium | 9 (8.5, 9.4) | 9.3 (8.85, 9.6) | 8.895 (8.3, 9.4) ** | 9 (8.5, 9.4) ** | 9.2 (8.65, 9.65) ** |
| Chloride | 105.2 (103, 107.5) | 105 (103, 107) * | 105 (102.075, 107.225) | 105 (102, 108) * | 106 (104, 108) ** |
| Creatinine | 0.9 (0.7, 1.2) | 1.1 (0.9, 1.5) ** | 1 (0.9, 1.2) * | 1.1 (0.8, 1.5) ** | 0.7 (0.6, 0.8) ** |
| Direct Bilirubin | 0.11 (0.1, 0.2) | 0.1 (0.1, 0.2) ** | 0.2 (0.1, 0.2) | 0.1 (0.1, 0.2) * | 0.1 (0.1, 0.18) |
| Estimated GFR | 75.38 ± 33.22 | 62.68 ± 32.54 ** | 71.92 ± 27.16 | 64.13 ± 34 ** | 91.57 ± 28.86 ** |
| Glucose AC | 120.65 ± 47.39 | 123.11 ± 41.14 | 129.36 ± 56.6 | 124.83 ± 93.83 | 109.57 ± 26.12 * |
| pH | 6.24 ± 0.89 | 6.03 ± 0.74 * | 6.09 ± 0.85 | 6.26 ± 0.86 | 6.12 ± 0.71 |
| Potassium | 3.92 ± 0.5 | 4.14 ± 0.52 ** | 4.02 ± 0.48 | 4.12 ± 0.61 ** | 4.06 ± 0.46 ** |
| Sodium | 139 (137, 141) | 139 (137, 140.775) | 138 (136, 140) * | 139 (137, 140.8) | 139 (138.5, 141) ** |
| Specific Gravity | 1.016 (1.012, 1.021) | 1.017 (1.013, 1.021) | 1.017 (1.011, 1.022) | 1.014 (1.01, 1.0195) * | 1.016 (1.011, 1.021) |
| Total Bilirubin | 0.9 ± 0.49 | 0.84 ± 0.31 | 1.08 ± 1.25 | 1.04 ± 1.93 | 0.81 ± 0.68 |
| Total Cholesterol | 190.15 ± 49.57 | 182.2 ± 42.36 | 189.88 ± 45.34 | 183.76 ± 44.58 | 200.1 ± 50.31 |
| Total Protein | 6.6 (6.0625, 7) | 6.9 (6.55, 7.2) ** | 6.7 (6, 7.1) | 6.7 (6.1, 7.1325) ** | 6.9 (6.3, 7.3) |
| Triglyceride | 143.47 ± 88.07 | 141.88 ± 170.99 | 142.3 ± 89.64 | 131.97 ± 91.15 | 133.65 ± 141.94 |
| Uric acid | 5.82 ± 1.83 | 6.1 ± 1.64 | 6.36 ± 2.61 | 6.22 ± 1.88 * | 5.4 ± 1.76 |
| Urine epitheilum (UL) | - | 0 (0, 6) | 2.5 (0, 5.75) | 3 (0, 9) | - |
| Urine epithelium count | 0 (0, 2) | 0 (0, 2) | 0 (0, 2) ** | 1 (0, 3) | 2 (0, 3) * |
| Nitrite | |||||
| 0 | 127 (88.2%) | 194 (97%) | 188 (93.5%) | 493 (83.4%) | 181 (90.5%) |
| 1 | 17 (11.8%) | 6 (3%) | 13 (6.5%) | 98 (16.6%) | 19 (9.5%) |
| Strip WBC | |||||
| 0 | 49 (34%) | 140 (70%) | 145 (72.1%) | 298 (50.4%) | 79 (39.5%) |
| 1 | 24 (16.9%) | 44 (22%) | 34 (17%) | 133 (62.1%) | 79 (39.5%) |
| 2 | 12 (8.3%) | 12 (6%) | 10 (5%) | 73 (12.4%) | 24 (12%) |
| 3 | 13 (9%) | 4 (2%) | 12 (6%) | 87 (14.7%) | 18 (9%) |
| Urine Bilirubin | |||||
| 0 | 137 (95.1%) | 190 (95%) | 190 (94.5%) | 549 (94.9%) | 194 (97%) |
| 1 | 4 (2.8%) | 10 (5%) | 11 (5.5%) | 28 (4.7%) | 4 (2%) |
| 2 | 2 (1.4%) | 0 | 0 | 7 (1.2%) | 2 (1%) |
| 3 | 1 (0.7%) | 0 | 0 | 7 (1.2%) | 0 |
| Urine Glucose | |||||
| 0 | 86 (59.7%) | 179 (89.5%) | 183 (91%) | 509 (86.1%) | 185 (92.5%) |
| 1 | 8 (5.6%) | 12 (6%) | 8 (4%) | 52 (8.8%) | 9 (4.5%) |
| 2 | 1 (0.7%) | 3 (1.5%) | 2 (1%) | 13 (2.2%) | 3 (1.5%) |
| 3 | 3 (2.1%) | 6 (3%) | 8 (4%) | 17 (2.9%) | 3 (1.5%) |
| Urine Ketone | |||||
| 0 | 86 (59.7%) | 184 (92%) | 176 (87.6%) | 523 (88.5%) | 156 (78%) |
| 1 | 10 (7%) | 14 (7%) | 24 (12%) | 58 (9.8%) | 38 (19%) |
| 2 | 0 | 0 | 0 | 6 (1%) | 5 (2.5%) |
| 3 | 2 (1.4%) | 2 (1%) | 1 (0.5%) | 4 (0.7%) | 1 (0.5%) |
| Urine Protein | |||||
| 0 | 75 (52.1%) | 124 (62%) | 118 (58.7%) | 279 (47.2%) | 156 (78%) |
| Trace | 16 (11.1%) | 25 (12.5%) | 25 (12.4%) | 50 (8.5%) | 15 (7.5%) |
| 1 | 15 (10.4%) | 19 (9.5%) | 27 (13.4%) | 91 (15.4%) | 9 (4.5%) |
| 2 | 24 (16.7%) | 19 (9.5%) | 23 (11.4%) | 97 (16.4%) | 12 (6%) |
| 3 | 14 (9.7%) | 13 (6.5%) | 8 (4%) | 74 (12.5%) | 8 (4%) |
| Urobilinogen | |||||
| 0 | 2 (1.4%) | 7 (3.5%) | 6 (3%) | 20 (3.4%) | 1 (0.5%) |
| 0.1 | 55 (38.2%) | 62 (31%) | 59 (29.4%) | 200 (33.8%) | 124 (62%) |
| 0.2 | 55 (38.2%) | 84 (42%) | 83 (41.3%) | 261 (44.2%) | 53 (26.5%) |
| 1 | 29 (20.1%) | 47 (23.5%) | 52 (25.9%) | 106 (17.9%) | 20 (10%) |
| 2 | 3 (2.1%) | 0 | 1 (0.5%) | 2 (0.3%) | 1 (0.5%) |
| 4 | 0 | 0 | 0 | 2 (0.3%) | 1 (0.5%) |
| Urine occult Blood | |||||
| 0 | 52 (36.1%) | 119 (59.5%) | 112 (55.7%) | 201 (34%) | 85 (42.5%) |
| Trace | 16 (11.1%) | 23 (11.5%) | 22 (10.9) | 66 (11.2%) | 26 (13%) |
| 1 | 12 (8.3%) | 25 (12.5%) | 16 (8%) | 53 (9%) | 28 (14%) |
| 2 | 19 (13.2%) | 13 (6.5%) | 21 (10.4%) | 73 (12.4%) | 29 (14.5%) |
| 3 | 45 (31.3%) | 20 (10%) | 30 (14.9%) | 198 (33.5%) | 32 (16%) |
* means p < 0.05, ** means p < 0.0001.
Figure 1The feature selection results of (A) cystitis vs. kidney cancer, (B) cystitis vs. prostate cancer, (C) cystitis vs. bladder cancer, (D) cystitis vs. uterus cancer, (E) kidney cancer vs. bladder cancer, (F) kidney cancer vs. prostate cancer, (G) kidney cancer vs. uterus cancer, (H) bladder cancer vs. prostate cancer, and (I) bladder cancer vs. uterus cancer.
The top selected feature from each comparison group.
| Comparison Group | Top Selected Feature | |
|---|---|---|
| cystitis | kidney cancer | Calcium |
| prostate cancer | ALP | |
| bladder cancer | Albumin | |
| uterus cancer | Urine Ketone | |
| kidney cancer | bladder cancer | Urine occult blood |
| prostate cancer | ALP | |
| uterus cancer | Calcium | |
| bladder cancer | prostate cancer | ALP |
| uterus cancer | Creatinine | |
Figure 2The specificity comparison of sampling techniques in five different models.
Figure 3The block diagram of the forward selection method.
Predictive performance of five models in discriminating patients with bladder cancer from patients with cystitis.
| Models | Accuracy (95%CI) | Precision (95%CI) | f1 Score (95%CI) | Sensitivity (95%CI) | Specificity (95%CI) | AUC (95%CI) |
|---|---|---|---|---|---|---|
| decision tree | 76.2% (71–81.5%) | 77.9% (69.1–86.6%) | 74.6% (69.8–79.5%) | 73.2% (62.5–83.9%) | 78.1% (64.9–91.3%) | 0.775 (0.711–0.839) |
| random forest | 83.1% (78.7–87.5%) | 78.2% (71.5–85%) | 81.6% (74.3–88.9%) | 85.5% (76–94.9%) | 79.4% (72–86.8%) | 0.887 (0.826–0.947) |
| SVM | 71.7% (63.5–80%) | 81.9% (71.3–92.4%) | 65.5% (51.9–79.1%) | 55.7% (40.6–70.8%) | 86.7% (78.9–94.5%) | 0.736 (0.624–0.849) |
| XGBoost | 82.8% (76.7–88.8%) | 84.7% (74.5–94.9%) | 82.7% (76.2–89.2%) | 81.4% (75.1–87.7%) | 83.3% (71–95.7%) | 0.879 (0.819–0.939) |
| lightGBM | 87.6% (81–94.1%) | 86.3% (77.9–94.6%) | 87.7% (81.8–93.5%) | 89.5% (84.2–94.9%) | 85.5% (75.2–95.7%) | 0.932 (0.862–1.000) |
Predictive performance of the lightGBM model.
| Groups | Accuracy (95%CI) | Precision (95%CI) | f1 Score (95%CI) | Sensitivity (95%CI) | Specificity (95%CI) | AUC (95%CI) |
|---|---|---|---|---|---|---|
| bladder cancer | 86.9% (77.3–96.5%) | 87.1% (76.5–97.7%) | 87.3% (77.5–97%) | 87.8% (77.3–98.3%) | 86.7% (76.8–96.6%) | 0.918 (0.849–0.988) |
| bladder cancer | 84.8% (76–93.6%) | 86.6% (76.8–96.4%) | 85.1% (75.3–94.9%) | 84.4% (71.9–96.9%) | 85.1% (76.4–93.8%) | 0.883 (0.823–0.942) |
| bladder cancer | 87.6% (81–94.1%) | 86.3% (77.9–94.6%) | 87.7% (81.8–93.5%) | 89.5% (84.2–94.9%) | 85.5% (75.2–95.7%) | 0.932 (0.862–1.001) |
| bladder cancer | 84.5% (78.3–90.6%) | 83% (73.2–92.9%) | 84.5% (78.3–90.7%) | 86.8% (79.9–93.7%) | 82.9% (72.7–93.2%) | 0.928 (0.88–0.977) |
| kidney cancer | 86.2% (78.2–94.2%) | 86.8% (78–95.6%) | 86.9% (79–94.9%) | 88% (76.5–99.6%) | 84.2% (73–95.5%) | 0.903 (0.854–0.952) |
| uterus cancer | 83.8% (77.1–90.5%) | 87% (76.8–97.3%) | 83.7% (76.7–90.7%) | 81.8% (71.2–92.5%) | 86.5% (76.9–96.1%) | 0.915 (0.834–0.995) |
| prostate cancer | 87.6% (82.9–92.2%) | 88.5% (80.3–96.6%) | 87% (81.7–92.4%) | 86.7% (77.5–95.8%) | 88.4% (77.9–98.9%) | 0.944 (0.903–0.986) |
| prostate cancer | 84.1% (77–91.3%) | 82.5% (68.2–96.7%) | 82.3% (70.1–94.5%) | 82.7% (70.5–94.9%) | 84.7% (76.4–93.1%) | 0.897 (0.837–0.957) |
| kidney cancer | 86.2% (78.4–94%) | 87.1% (76.5–97.7%) | 86.8% (79–94.5%) | 87.1% (79.5–94.6%) | 86.2% (75.6–96.9%) | 0.923 (0.858–0.988) |
| kidney cancer | 84.5% (77.7–91.2%) | 82.1% (72.7–91.6%) | 84.9% (79.1–90.7%) | 88.6% (82.5–94.6%) | 81.3% (69.1–93.5%) | 0.91 (0.849–0.972) |
Figure 4The receiver operating characteristic curve (ROC) plot of separating bladder cancer from cystitis or other cancers.