| Literature DB >> 22848691 |
Peng Gao1, Xin Zhou, Zhen-ning Wang, Yong-xi Song, Lin-lin Tong, Ying-ying Xu, Zhen-yu Yue, Hui-mian Xu.
Abstract
OBJECTIVE: Over the past decades, many studies have used data mining technology to predict the 5-year survival rate of colorectal cancer, but there have been few reports that compared multiple data mining algorithms to the TNM classification of malignant tumors (TNM) staging system using a dataset in which the training and testing data were from different sources. Here we compared nine data mining algorithms to the TNM staging system for colorectal survival analysis.Entities:
Mesh:
Year: 2012 PMID: 22848691 PMCID: PMC3404978 DOI: 10.1371/journal.pone.0042015
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Variables Available for Analysis.
| Variable | Type | Explanation | Supported |
| Inputs | |||
| Age at diagnosis | Numeric | Years | Bothd |
| Race/ethnicity | Categorical | 22 races | Both |
| Sex | Binary | Female/male | Both |
| Primary Site | Categorical | Eleven sites | Both |
| AJCCa stage 7th | Categorical | Pathologic code of TNM | Both |
| Grade | Categorical | Tumor grade (Grades 1–4) | Both |
| EODb 10 - size | Numeric | Size of primary tumor | Both |
| EOD 10 - extent | Categorical | Invasive extension of primary tumor | Both |
| EOD 10 - nodes | Categorical | Extension of lymph node involvement | SEERe |
| Regional nodes examined | Numeric | No. of regional lymph nodes examined | Both |
| Regional nodes positive | Numeric | No. of positive regional lymph nodes | Both |
| SEER historic stage A | Categorical | A stage system coded by SEER | SEER |
| SEER summary stage 1977 | Categorical | A stage system coded by SEER | SEER |
| Histologic Type ICD-O-3c | Categorical | International Classification of Diseases for Oncology Third Revision | SEER |
| Number of primaries | Numeric | Number of primaries | Both |
| First malignant primary indicator | Binary | Yes/no | SEER |
| Radiation sequence with surgery | Categorical | Prior to/after surgery/both | SEER |
| Surgery of primary site | Categorical | Extension of surgery | Both |
| Surgery of oth reg/dis sites | Categorical | Surgery of other regional site(s), distant site(s), or distant lymph node(s) | Both |
| Outcome | |||
| SEER cause-specific death classification | Binary | Yes/no | Both |
AJCCa: American Joint Committee on Cancer.
EODb: SEER extent of disease.
ICD-O-3c: International Classification of Diseases for Oncology Third Revision.
Bothd: Supported by both SEER dataset and CMU-SO dataset.
SEERe: National Cancer Institute's Surveillance, Epidemiology, and End Results.
Nine algorithms used in the construction of prediction models.
| Data mining algorithm | Features | Details | Optimized parameters | Reference |
| BP | one kind of artificial neural networks which is used most frequently | A single hidden layer was used and Levenberg-Marquardt backpropagation was used as a backpropagation network1000 cases derived from the training cases were chosen for validation to avoid “overfitting” | the number of neurons in the hidden layer | Burke |
| CART | easy to understand and efficient training algorithm | 1000 cases divided from training cases were chosen for pruning | Valera | |
| SVM | more robust and overfitting is unlikely to occur | The kernel function used in this study was RBF | cost (c) which controls overfitting of the model, and gamma (g), which controls the degree of nonlinearity of the model | Tanabe |
| ANFIS | a fuzzy inference system in the framework of a multilayer feed-forward network | subtractive clustering and the hybrid learning algorithm were used to generate the fuzzy inference system and the parameter estimation of membership function | Schwarzer | |
| RBF | an alternative to the BP network which has a radial basis layer | Venkatesan | ||
| GRNN | similar to RBF and has a special linear layer after the radial basis layer | the width of the RBF, denoted as the spread | Lai | |
| LR | has a good accuracy and fast development timebased on the linear assumption | Schwarzer | ||
| NB | easy to understand and efficienttraining algorithm which assumes attributes are statistically independent | Friedman | ||
| BNs | a algorithm based on Bayes' theorem and represents conditional dependencies of variables via a directed acyclic graph | maximum likelihood parameter estimation was used for the parameter learning | the structure learning method was chosen among three algorithms including K2, Markov Chain Monte Carlo and tree augmented Naïve Bayesian | Jayasurya |
Figure 1The optimization and prediction system.
SEER dataset prediction result A represents the 9*2 predictive results trained by nine data mining algorithms together with two variable selection methods and tested on the SEER testing dataset with all 20 variables. SEER prediction result B represents the 9*2 predictive results tested on the SEER testing dataset with 14 variables supported by both SEER and CMU-SO datasets. CMU-SO prediction result represents the 9*2 predictive results tested on the CMU-SO testing dataset with 14 variables supported by both SEER and CMU-SO datasets.
AUCa calculated by testing prediction models on SEERb.
| with 20 variables | with 14 variables |
| |||||||
| AUC | 95% confidence interval | variable selectionc |
| AUC | 95% confidence interval | variable selection |
| ||
| BP | 81.72% | 79.99%–83.84% | BSFSd | <0.001 | 82.06% | 80.14%–83.99% | GA | <0.001 | 0.866 |
| CART | 80.08% | 78.05%–82.10% | BSFS | 0.007 | 80.11% | 78.09%–82.13% | BSFS | 0.005 | 0.866 |
| SVM | 81.34% | 79.37%–83.32% | BSFS | <0.001 | 81.46% | 79.50%–83.42% | Global | <0.001 | 0.998 |
| ANFIS | 81.97% | 80.03%–83.89% | GAe | <0.001 | 82.06% | 80.15%–83.97% | GA | <0.001 | 0.811 |
| RBF | 81.67% | 79.73%–83.63% | GA | <0.001 | 82.06% | 79.92%–83.78% | Global | <0.001 | 0.620 |
| GRNN | 80.98% | 78.99%–82.96% | GA | <0.001 | 80.83% | 78.84%–82.81% | BSFS | <0.001 | 0.625 |
| LR | 81.44% | 79.47%–83.38% | Globalf | <0.001 | 81.42% | 79.46%–83.37% | Global | <0.001 | 0.877 |
| NB | 80.10% | 78.07%–82.09% | GA | 0.003 | 79.99% | 77.96%–82.03% | BSFS | <0.001 | 0.766 |
| BNs | 80.18% | 78.16%–82.20% | GA | <0.001 | 80.25% | 78.23%–82.26% | GA | <0.001 | 0.428 |
| TNM | 78.40% | 76.28%–80.51% | 78.40% | 76.28%–80.51% | |||||
AUCa: area under the receiver operating characteristic curves.
SEERb: National Cancer Institute's Surveillance, Epidemiology, and End Results.
variable selectionc: the variable selection method which has the highest AUC.
BSFSd: variable selection using backward stepwise feature selection.
GAe: variable selection using genetic algorithms.
Globalf: without variable selection.
: median AUC of 15 tests.
: comparing the AUC of prediction models with TNM staging system.
: comparing the AUC of prediction models with 20 variables to that with 14 variables.
Figure 2The ROC curve from two different testing datasets.
A. Comparison of the predictive accuracy of three prognostic models: ANFIS together with GA, NB together with BSFS and the AJCC 7th TNM staging system using SEER testing dataset with 14 variables as a testing dataset. B. Comparison of the predictive accuracy of three prognostic models: LR together with BSFS, ANFIS together with GA and the AJCC 7th TNM staging system using the CMU-SO testing dataset as a testing dataset.
AUCa calculated by testing prediction models on CMU-SOb.
| AUC | 95% confidence interval | variable selectionc |
| |
| BP | 78.15% | 75.01%–81.10% | GAd | 0.074 |
| CART | 77.29% | 73.73%–80.84% | BSFSe | 0.209 |
| SVM | 77.95% | 74.47%–81.44% | Globalf | 0.115 |
| ANFIS | 77.20% | 73.64%–80.76% | GA | 0.410 |
| RBF | 77.22% | 73.67%–80.76% | GA | 0.386 |
| GRNN | 78.24% | 74.77%–81.70% | GA | 0.004 |
| LR | 78.24% | 74.82%–81.67% | BSFS | 0.044 |
| NB | 78.19% | 74.69%–81.69% | BSFS | 0.005 |
| BNs | 77.90% | 74.41%–81.39% | Global | 0.013 |
| TNM | 75.93% | 72.29%–79.57% |
AUCa: area under the receiver operating characteristic curves.
CMU-SOb: A dataset collects clinical information from Department of Surgical Oncology at the First Hospital of China Medical University.
variable selectionc: the variable selection method which has the highest AUC.
Globald: without variable selection.
GAe: variable selection using genetic algorithms.
BSFSf: variable selection using backward stepwise feature selection.
: median AUC of 15 tests.
: comparing the AUC of prediction models with TNM staging system.
Figure 3Survival rates at eight risk levels.
Comparative result between predictive survival rates to the real-world survival rates at eight different risk levels. The predictive survival rate is based on a predictive model built by LR together with BSFS.