| Literature DB >> 23506640 |
Hsiu-Ling Chou, Chung-Tay Yao, Sui-Lun Su, Chia-Yi Lee, Kuang-Yu Hu, Harn-Jing Terng, Yun-Wen Shih, Yu-Tien Chang, Yu-Fen Lu, Chi-Wen Chang, Mark L Wahlqvist, Thomas Wetter, Chi-Ming Chu.
Abstract
BACKGROUND: Microarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann-Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23506640 PMCID: PMC3614553 DOI: 10.1186/1471-2105-14-100
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow chart showing the protocol used for the search and download of breast cancer microarray datasets from the GEO database.
Breast cancer microarray datasets
| GSE7390 | 2007 | Desmedt et al. [ | Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. | HG-U133A | 198 |
| GSE2990 | 2006 | Sotiriou et al. [ | Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. | HG-U133A | 189 |
| GSE4922 | 2006 | Ivshina et al. [ | Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. | HG-U133A | 249 |
| GSE2034 | 2005 | Wang et al. [ | Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. | HG-U133A | 286 |
Abbreviations: GEO, Gene Expression Omnibus; GSE, GEO Datasets Number Prefixes; HG-U133A, a type of oligonucleotide Gene Chip from the Affymetrix.
Clinical variables of each dataset
| Clinical data | Lymph node status | Lymph node status | Lymph node status | Lymph node status | Lymph node status |
| | Estrogen receptor | Estrogen receptor | Estrogen receptor | Estrogen receptor | Estrogen receptor |
| | | Age | Age | Age | Age |
| | | Tumor size | Tumor size | Tumor size | Tumor size |
| | | Histopathologic grade | Histopathologic grade | Histopathologic grade | Histopathologic grade |
| Treatment | Estrogen therapy | Estrogen therapy | Tamoxifen therapy | Surgical therapy | Surgical therapy |
| | Adjuvant therapy | Adjuvant therapy | Surgical therapy | | Tamoxifen therapy |
| | Surgical therapy | Surgical therapy | | | Estrogen therapy |
| | | | | | Adjuvant therapy |
| Survival | Distant metastasis events | Relapse eventsa | Relapse events b | Relapse events b | Relapse events a |
| | Distant metastasis time | Relapse time | Relapse time | Relapse time | Relapse time |
| | | | Distant metastasis events | Survival events | |
| | | | Distant metastasis time | Survival time | |
| | | | | Distant metastasis events | |
| Distant metastasis time |
a Data includes death by breast cancer or any form of breast cancer recurrence (including local lymphatic drainage and distant metastases).
b Data includes any form of breast cancer recurrence (including local lymphatic drainage and distant metastases).
Figure 2Flow chart of the protocol used for study subject selection.
Figure 3Diagram of the methods used to identify predictive genes and establish prediction models.
Assessment of the predictive power of each single model using the 100-gene profile
| DT | 93.63 | 63.45 | 30.18 | 94.02 | 56.90 | 37.13 |
| LR | 82.53 | 64.12 | 18.40 | 87.68 | 58.96 | 28.72 |
| ANN80 | 73.42 | 70.93 | 4.09 | 72.11 | 64.09 | 8.02 |
| ANN100 | 84.63 | 69.54 | 15.09 | 84.98 | 63.88 | 21.09 |
Abbreviations: ACC, ACCuracy; AUC, Area Under the Curve; DT, Decision Tree; LR, Logistic Regression; ANN80, Artificial Neural Network using 80% resampling set (over-training prevention); ANN100, ANN using 100% resampling set (without over-training prevention).
assessment of the predictive power of the composite models using the 100-gene profile
| DL | 75.60 | 68.90 | 6.69 | 77.59 | 61.66 | 15.93 |
| DA80 | 72.69 | 69.30 | 3.39 | 71.92 | 64.20 | 7.72 |
| DA100 | 89.91 | 65.91 | 22.56 | 87.74 | 61.65 | 26.10 |
Abbreviations: ACC, ACCuracy; AUC, Area Under the Curve; DL, Decision Tree combined with Logistic regression; DA80, Decision tree combined with Artificial neural network using 80% resampling set (over-training prevention); DA100, Decision tree combined with Artificial neural network using 100% resampling set (without over-training prevention).
Figure 4The AUC values of different gene numbers and the Cox regression of five-year recurrence rates of the test samples.
Figure 5Kaplan-Meier analysis of 21 gene expression profile.
Cox regression analysis of the five-year breast cancer recurrence of the test samples
| | |||||
|---|---|---|---|---|---|
| 21 Genes Profile | 3.53 (2.24-5.58) | <.001 | 2.60 ( 1.44-4.68) | .001 | 0.454 |
| Age | 0.98 (0.96-1.00) | .115 | 0.99 (0.977-1.02) | .896 | 0.012 |
| Tumor Diameter | 1.54 (1.21-1.95) | <.001 | 1.41 (1.07-1.86) | .013 | 0.121 |
| Histopathologic gradea | 4.85 (1.94-12.09) | .001 | 3.59 (1.41-9.16) | .007 | 0.182 |
| Estrogen Receptor b | 1.50 (1.02-2.20) | .035 | 1.03 (0.58-1.82) | .902 | 0.035 |
Abbreviations: HR, Hazard Ratio. CI, Confidence Interval. P, statistical P value. NDI, Net Reclassification Improvement. a, Histopathologic grade: assessed by the Nottingham grading system. b, Estrogen receptor: negative or positive.
Figure 6Breast cancer-related genes and DNA damage checkpoint regulation at the G2/M phase of the cell cycle. ANN: Artificial Neural Network; DA: Decision Tree combined with ANN; LR: Logistic Regression; DL: Decision Tree combined with LR; DT: Decision Tree.
Figure 7Accuracy ratio between single and composite models.