| Literature DB >> 19352459 |
Sandeep Chandana1, Henry Leung, Kiril Trpkov.
Abstract
A novel technique of automatically selecting the best pairs of features and sampling techniques to predict the stage of prostate cancer is proposed in this study. The problem of class imbalance, which is prominent in most medical data sets is also addressed here. Three feature subsets obtained by the use of principal components analysis (PCA), genetic algorithm (GA) and rough sets (RS) based approaches were also used in the study. The performance of under-sampling, synthetic minority over-sampling technique (SMOTE) and a combination of the two were also investigated and the performance of the obtained models was compared. To combine the classifier outputs, we used the Dempster-Shafer (DS) theory, whereas the actual choice of combined models was made using a GA. We found that the best performance for the overall system resulted from the use of under sampled data combined with rough sets based features modeled as a support vector machine (SVM).Entities:
Keywords: classifier fusion; prostate cancer; staging
Year: 2009 PMID: 19352459 PMCID: PMC2664701 DOI: 10.4137/cin.s819
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Patient characteristics.
| Clinical parameters | Patients with Organ confined PCa (n = 934) | Patients with Extraprostatic extension (n = 120) | All patients (n = 1054) |
|---|---|---|---|
| Age (years): median (range) | 59.9 (36.2–77.4) | 63.4 (42.7–74.2) | 60.4 (36.2–77.4) |
| Age categories (years): | |||
| ≤50 (%) | 103 (11) | 5 (4.2) | 108 (10.2) |
| >50–60 (%) | 368 (39.4) | 34 (28.3) | 402 (38.1) |
| >60–70 (%) | 404 (43.3) | 67 (55.8) | 471 (44.7) |
| >70–80 (%) | 59 (6.3) | 14 (11.7) | 73 (6.9) |
| PSA (ng/ml): median (range) | 5.7 (0.29–55) | 6.9 (1.8–80) | 5.8 (0.29–80) |
| PSA categories (ng/ml): | |||
| ≤4 (%) | 169 (18.1) | 12 (10.0) | 181 (17.2) |
| >4–10 (%) | 662 (70.9) | 73 (60.8) | 735 (69.7) |
| >10–20 (%) | 98 (10.5) | 27 (22.5) | 125 (11.9) |
| >20–50 (%) | 3 (0.3) | 6 (5.0) | 9 (0.9) |
| >50–100 (%) | 2 (0.2) | 2 (1.7) | 4 (0.4) |
| Prostate Gland Volume (cc): median (range) | 35.6 (7–193.2) | 32.1 (10.02–176.6) | 35.15 (7–193.2) |
| Prostate Gland Volume categories (cc): | |||
| ≤25 (%) | 188 (20.1) | 33 (27.5) | 221 (21.0) |
| >25–50 (%) | 503 (53.9) | 61 (50.8) | 564 (53.5) |
| >50–100 (%) | 243 (26.0) | 26 (21.7) | 269 (25.5) |
| PSA Density: median (range) | 0.15 (0.01–2.1) | 0.22 (0.03–2.5) | 0.16 (0.01–2.5) |
| PSA Density categories: | |||
| ≤0.10 (%) | 231 (24.7) | 18 (15.0) | 249 (23.6) |
| >0.10–0.25 (%) | 535 (57.3) | 49 (40.8) | 584 (55.4) |
| >0.25–0.50 (%) | 139 (14.9) | 38 (31.7) | 177 (16.8) |
| >0.50–1.0 (%) | 25 (2.7) | 11 (9.2) | 36 (3.4) |
| >1 (%) | 4 (0.4) | 4 (3.3) | 8 (0.8) |
| PCa Length (mm): Median (range) | 5.25 (0–117) | 19 (0.15–102.75) | 6 (0–117) |
| PCa Length Categories: | |||
| ≤10 (%) | 615 (65.8) | 41 (34.2) | 656 (62.2) |
| >10–20 (%) | 174 (18.6) | 25 (20.8) | 199 (18.9) |
| >20–40 (%) | 107 (11.5) | 37 (30.8) | 144 (13.7) |
| >40–60 (%) | 31 (3.3) | 12 (10.0) | 43 (4.1) |
| >60 (%) | 7 (0.7) | 5 (4.2) | 12 (1.1) |
| No. of cancer-positive cores: median (range) | 2 (1–10) | 4 (1–10) | 2 (1–10) |
| No. of cancer-positive cores categories: | |||
| ≤4 (%) | 767 (82.1) | 74 (61.7) | 841 (79.8) |
| >4–6 (%) | 109 (11.7) | 27 (22.5) | 136 (12.9) |
| >6 (%) | 58 (6.2) | 19 (15.8) | 77 (7.3) |
| Percent total core involvement on biopsy: median (range) | 3.5 (0–78) | 12.5 (0.1–68.5) | 4 (0–78) |
| Percent core involvement categories: | |||
| ≤3 (%) | 441 (47.2) | 20 (16.7) | 461 (43.7) |
| >3–10 (%) | 280 (30.0) | 30 (25.0) | 310 (29.4) |
| >10–20 (%) | 140 (15.0) | 42 (35.0) | 182 (17.3) |
| >20 (%) | 73 (7.8) | 28 (23.3) | 101 (9.6) |
| Biopsy Gleason Score: median (range) | 6 (4–9) | 7 (6–9) | 6 (4–9) |
| Biopsy Gleason Score categories: | |||
| ≤6 (%) | 683 (73.1) | 39 (32.5) | 722 (68.5) |
| 7–8 (%) | 246 (26.3) | 77 (64.2) | 323 (30.6) |
| >8 (%) | 5 (0.5) | 4 (3.3) | 9 (0.9) |
Abbreviations: PSA, prostate specific antigen; PCa, prostate cancer.
ROC confusion matrix.
| Predicted class
| |||
|---|---|---|---|
| Positive | Negative | ||
| Actual class | Positive | TP | FN |
| Negative | FP | TN | |
Abbreviations: TP, true positive; FN, false negative; FP, false positives; TN, true negatives.
Figure 1SVM performance for different training data sizes.
Figure 3ROC curves of SVM and Rough Set Features.
Figure 14ROC curves of KNN and combined sampling.
AUC values with SVM for PCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.8409 | 0.7223 | 0.8326 | 0.8611 |
| PCA | 0.8075 | 0.7439 | 0.8334 | 0.8392 |
| GA | 0.7704 | 0.7425 | 0.7112 | 0.7597 |
| DS | 0.8313 | 0.7461 | 0.8420 |
AUC values with KNN for PCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.7295 | 0.8088 | 0.7764 | 0.8065 |
| PCA | 0.6543 | 0.7450 | 0.7787 | 0.7891 |
| GA | 0.7383 | 0.7454 | 0.7560 | 0.7926 |
| DS | 0.7484 | 0.8001 | 0.7798 |
Performance of GA optimized fusion for PCa.
| 2-models | 3-models | 4-models | 5-models | |
|---|---|---|---|---|
| Models | SVM-R-U | SVM-R-U | SVM-R-U | SVM-R-U |
| SVM-R-US | SVM-P-US | SVM-P-U | KNN-R-S | |
| SVM-R-US | KNN-R-S | SVM-P-US | ||
| KNN-G-U | SVM-R-US | |||
| KNN-G-U | ||||
| AUC | 0.8617 | 0.8626 | 0.8640 | 0.8631 |
| Accuracy | 89.4% | 89.7% | 90.1% | 89.8% |
| AUC | 0.8640 | 0.8049 | 0.8359 | 0.8580 |
| Accuracy | 90.1% | 83.0% | 86.0% | 88.5% |
AUC values with SVM for SimPCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.8376 | 0.7340 | 0.8312 | 0.8535 |
| PCA | 0.8076 | 0.7195 | 0.8304 | 0.8310 |
| GA | 0.7763 | 0.7707 | 0.7738 | 0.7790 |
| DS | 0.8403 | 0.7716 | 0.8336 |
Performance of GA optimized fusion for SimPCa.
| 2-models | 3-models | 4-models | 5-models | |
|---|---|---|---|---|
| Models | SVM-R-U | SVM-R-U | SVM-R-U | SVM-R-U |
| SVM-R-US | SVM-P-US | SVM-P-U | KNN-R-S | |
| SVM-R-US | KNN-R-S | SVM-P-US | ||
| KNN-G-U | SVM-R-US | |||
| KNN-G-U | ||||
| AUC | 0.8596 | 0.8620 | 0.8632 | 0.8610 |
| Accuracy | 88.8% | 89.4% | 89.8% | 89.3% |
AUC values with SVM as the classifier for BCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.8917 | 0.9301 | 0.9680 | 0.9691 |
| PCA | 0.9342 | 0.9385 | 0.9360 | 0.9429 |
| GA | 0.9965 | 0.9737 | 0.9920 | 0.9965 |
| DS | 0.9965 | 0.9753 | 0.9921 |
AUC values with KNN as the classifier for BCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.9202 | 0.9243 | 0.9236 | 0.9270 |
| PCA | 0.9233 | 0.9230 | 0.9240 | 0.9240 |
| GA | 0.9289 | 0.9276 | 0.9318 | 0.9333 |
| DS | 0.9296 | 0.9290 | 0.9330 |
AUC values with SVM as the classifier for LCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.9265 | 0.9310 | 0.9305 | 0.9356 |
| PCA | 0.9000 | 0.9184 | 0.9376 | 0.9380 |
| GA | 0.9366 | 0.9298 | 0.9275 | 0.9349 |
| DS | 0.9366 | 0.9327 | 0.9401 |
AUC values with KNN as the classifier for LCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.9297 | 0.9140 | 0.9139 | 0.9321 |
| PCA | 0.9308 | 0.9384 | 0.9381 | 0.9396 |
| GA | 0.9350 | 0.9367 | 0.9330 | 0.9371 |
| DS | 0.9385 | 0.9390 | 0.9387 |
Performance of the GA optimized fusion for BCa.
| 2-models | 3-models | 4-models | 5-models | |
|---|---|---|---|---|
| Models | SVM-G-U | SVM-G-U | SVM-G-U | SVM-G-U |
| SVM-G-US | SVM-G-US | SVM-G-US | SVM-G-US | |
| KNN-G-US | KNN-R-US | KNN-R-US | ||
| KNN-G-US | KNN-G-US | |||
| SVM-G-S | ||||
| AUC | 0.9965 | 0.9994 | 0.9998 | 0.9998 |
| Accuracy | 99.1% | 99.4% | 99.4% | 99.4% |
Performance of the GA optimized fusion for LCa.
| 2-models | 3-models | 4-models | 5-models | |
|---|---|---|---|---|
| Models | KNN-P-S | KNN-P-S | KNN-P-S | KNN-P-S |
| SVM-P-US | SVM-P-US | SVM-P-US | SVM-P-US | |
| KNN-P-US | KNN-P-US | KNN-P-US | ||
| KNN-G-S | KNN-G-S | |||
| SVM-G-U | ||||
| AUC | 0.9410 | 0.9427 | 0.9451 | 0.9451 |
| Accuracy | 96.6% | 97% | 97.5% | 97.5% |
Classification accuracy for BCa.
| Work (Year) | Method | Classification accuracy % |
|---|---|---|
| Quinlan | C4.5 | 94.74 |
| Pena-Reyes and Sipper | Fuzzy-GA | 97.36 |
| Goodman et al. | LVQ | 96.80 |
| Ubeyli | EM-NN | 98.85 |
| Polat and Gunes | LS-SVM | 98.53 |
| Akay | SVM | 99.02 |
| Proposed method | GA-Fusion | 99.40 |
Abbreviations: LVQ, linear vector quantization; EM, expectation maximization, LS, least squares.
Classification Accuracy for LCa.
| Work (Year) | Method | Classification accuracy % |
|---|---|---|
| Aeberhard et al. | RDA | 62.5 |
| Luukka | Yu-ANFIS | 65.48 |
| Proposed method | GA-Fusion | 97.5 |
Abbreviations: RDA, regularized discriminant analysis; Yu, ANFIS- Yu norm ANFIS.
AUC values with KNN for SimPCa.
| Under | Smote | UnderSmote | DS | |
|---|---|---|---|---|
| RST | 0.7290 | 0.8166 | 0.7705 | 0.8166 |
| PCA | 0.7130 | 0.7576 | 0.7701 | 0.7710 |
| GA | 0.8334 | 0.7358 | 0.7660 | 0.8346 |
| DS | 0.8360 | 0.8202 | 0.7740 |