| Literature DB >> 32132862 |
Wen-Chien Ting1,2, Yen-Chiao Angel Lu3, Wei-Chi Ho4, Chalong Cheewakriangkrai5, Horng-Rong Chang6,7, Chia-Ling Lin8.
Abstract
BACKGROUND: Colorectal cancer (CRC) is the third commonly diagnosed cancer worldwide. Recurrence of CRC (Re) and onset of a second primary malignancy (SPM) are important indicators in treating CRC, but it is often difficult to predict the onset of a SPM. Therefore, we used mechanical learning to identify risk factors that affect Re and SPM. PATIENT AND METHODS: CRC patients with cancer registry database at three medical centers were identified. All patients were classified based on Re or no recurrence (NRe) as well as SPM or no SPM (NSPM). Two classifiers, namely A Library for Support Vector Machines (LIBSVM) and Reduced Error Pruning Tree (REPTree), were applied to analyze the relationship between clinical features and Re and/or SPM category by constructing optimized models.Entities:
Keywords: colorectal cancer; machine learning; second primary malignancy
Mesh:
Year: 2020 PMID: 32132862 PMCID: PMC7053359 DOI: 10.7150/ijms.37134
Source DB: PubMed Journal: Int J Med Sci ISSN: 1449-1907 Impact factor: 3.738
Figure 1Second primary malignancies and recurrence of colorectal cancer (CRC) in the study population.
Figure 2Workflow for model construction from clinical data.
Model evaluation for CRC recurrence alone.
| Classifier | TP | FP | TN | FN | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|
| LIBSVM | 2040 | 351 | 1638 | 270 | 0.883 | 0.824 | 0.856 | 0.709 |
| LIBSVM_4F | 2063 | 281 | 1708 | 247 | 0.893 | 0.859 | 0.877 | 0.753 |
| LIBSVM_3F | 2059 | 273 | 1716 | 251 | 0.891 | 0.863 | 0.878 | 0.755 |
| REPTree | 2035 | 263 | 1726 | 275 | 0.881 | 0.868 | 0.875 | 0.748 |
| REPTree_3F | 2070 | 286 | 1703 | 240 | 0.896 | 0.856 | 0.878 | 0.754 |
LIBSVM_3F, REPTree_3F, and LIBSVM_4F models were constructed using feature selection with the top three or four features, respectively.
Figure 3Decision tree of important factors for recurrent CRC classification using the REPTree_3F model.
Model evaluation for SPM alone.
| Ratio | Classifier | TP | FP | TN | FN | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|---|
| 1:6.95 | LIBSVM | 315 | 1345 | 2413 | 226 | 0.582 | 0.642 | 0.635 | 0.153 |
| 1:1.5 | LIBSVM | 173 | 89 | 723 | 368 | 0.320 | 0.890 | 0.662 | 0.261 |
| 1:1 | LIBSVM | 350 | 175 | 366 | 191 | 0.647 | 0.677 | 0.662 | 0.324 |
| 1:1 | LIBSVM_F | 363 | 188 | 353 | 178 | 0.671 | 0.652 | 0.662 | 0.324 |
| 1:1 | REPTree | 373 | 221 | 320 | 168 | 0.689 | 0.591 | 0.640 | 0.282 |
| 1:1 | REPTree_OP | 382 | 224 | 317 | 159 | 0.706 | 0.586 | 0.646 | 0.294 |
| 1:1 | REPTree_F | 363 | 240 | 301 | 178 | 0.671 | 0.556 | 0.614 | 0.229 |
LIMSVM_F and REPTree_F models were constructed using feature selection with the top eight features. REPTree_OP model was constructed [using parameter optimization OR as an optimized model] with the top three features.
Figure 4Decision tree of important features for SPM classification using the REPTree-OP model.
The model evaluation for second primary malignancies and recurrent cancer co-discussion.
| Classifier | TP | FP | TN | FN | Sn | Sp | Acc | MCC | |
|---|---|---|---|---|---|---|---|---|---|
| SPM+Re | REPTree | 161 | 69 | 141 | 47 | 0.774 | 0.671 | 0.722 | 0.448 |
| REPTree_F | 169 | 83 | 127 | 39 | 0.813 | 0.605 | 0.708 | 0.426 | |
| LIBSVM | 145 | 63 | 148 | 62 | 0.700 | 0.701 | 0.701 | 0.402 | |
| LIBSVM_F | 161 | 65 | 145 | 47 | 0.774 | 0.690 | 0.732 | 0.466 | |
| NSPM+Re | REPTree | 1546 | 365 | 1416 | 235 | 0.868 | 0.795 | 0.832 | 0.665 |
| REPTree_F | 1572 | 374 | 1407 | 209 | 0.883 | 0.790 | 0.836 | 0.676 | |
| LIBSVM | 1477 | 365 | 1416 | 304 | 0.829 | 0.795 | 0.812 | 0.625 | |
| LIBSVM_F | 1569 | 370 | 1411 | 212 | 0.881 | 0.792 | 0.837 | 0.676 | |
| SMP+NRe | REPTree | 252 | 104 | 229 | 81 | 0.757 | 0.688 | 0.722 | 0.446 |
| REPTree_OP | 274 | 108 | 225 | 59 | 0.823 | 0.676 | 0.749 | 0.504 | |
| REPTree_F | 236 | 98 | 235 | 97 | 0.709 | 0.706 | 0.707 | 0.414 | |
| LIBSVM | 235 | 99 | 234 | 98 | 0.706 | 0.703 | 0.704 | 0.408 | |
| LIBSVM_F | 226 | 63 | 270 | 107 | 0.679 | 0.811 | 0.745 | 0.494 | |
| NSPM+NRe | REPTree | 1705 | 504 | 1473 | 272 | 0.862 | 0.745 | 0.804 | 0.612 |
| REPTree_OP | 1739 | 505 | 1472 | 238 | 0.880 | 0.745 | 0.812 | 0.630 | |
| REPTree_F | 1670 | 465 | 1512 | 307 | 0.845 | 0.765 | 0.805 | 0.611 | |
| LIBSVM | 1700 | 539 | 1438 | 277 | 0.860 | 0.727 | 0.794 | 0.592 | |
| LIBSVM_2F | 1705 | 502 | 1475 | 272 | 0.862 | 0.746 | 0.804 | 0.613 |
LIBSVM_F indicates SVM model building with feature selection. REPTree_F indicates REPTree model building with feature selection. REPTree_OP indicates REPTree model building by parameters optimization.
Figure 5Decision tree of important factors for SPM + Re classification using the REP Tree_F model.
Figure 6Decision tree of important factors for NSPM+Re classification using the REPTree_F model.
Figure 7Decision tree of important factors for SPM+NRe classification using the REPTree_OP model.
Figure 8Decision tree of important factors for NSPM+NRe classification using the REPTree_OP model.
Order of top ten features by F-score for feature selection
| Re | SPM | SPM+Re | NSPM+Re | SPM+NRe | NSPM+NRe |
|---|---|---|---|---|---|
| pStage | behavior code | Surgical edge | pStage | behavior code | pStage |
| Surgical edge | differentiation | pStage | surgical edge | pStage | surgical edge |
| Smoking | regional body order | areca | behavior code | surgical edge | differentiation |
| drink | age | drink | smoking | highest dose | tumor size |
| radiation therapy | areca | Smoking | radiation therapy | radiation therapy | smoking |
| areca | surgery | BMI | drink | age | drink |
| differentiation | radiation therapy | age | surgery | lower number of times | radiation therapy |
| surgery | lowest dose | differentiation | areca | smoking | areca |
| BMI | organizational patterns | lowest dose | BMI | radiation therapy before surgery | BMI |
| behavior code | highest dose | tumor size | differentiation | tumor size | surgery |