| Literature DB >> 26967049 |
Wenyuan Zhao1, Beibei Chen1, Xin Guo1, Ruiping Wang1, Zhiqiang Chang1, Yu Dong1, Kai Song1, Wen Wang1, Lishuang Qi1, Yunyan Gu1, Chenguang Wang1, Da Yang2,3,4, Zheng Guo1,5.
Abstract
The irreproducibility problem seriously hinders the studies on transcriptional signatures for predicting relapse risk of early stage colorectal cancer (CRC) patients. Through reviewing recently published 34 literatures for the development of CRC prognostic signatures based on gene expression profiles, we revealed a surprising phenomenon that 33 of these studies analyzed CRC samples with and without adjuvant chemotherapy together in the training and/or validation datasets. This data misuse problem could be partially attributed to the unclear and incomplete data annotation in public data sources. Furthermore, all the signatures proposed by these studies were based on risk scores summarized from gene expression levels, which are sensitive to experimental batch effects and risk compositions of the samples analyzed together. To avoid the above-mentioned problems, we carefully selected three qualified large datasets to develop and validate a signature consisting of three pairs of genes. The within-sample relative expression orderings of these gene pairs could robustly predict relapse risk of stage II CRC samples assessed in different laboratories. The transcriptional and functional analyses provided clear evidence that the high risk patients predicted by the proposed signature represent patients with micro-metastases.Entities:
Keywords: experimental batch effect; gene expression profiles; gene pairs; prognostic signatures; relative expression
Mesh:
Year: 2016 PMID: 26967049 PMCID: PMC4951352 DOI: 10.18632/oncotarget.7956
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Figure 1The flowchart for the development of the rank-based prognostic GPS
Proposed gene expression signatures for prognostic assessment of CRC
| Date | Datasets | Mixed | Tumor stage | Prognostic endpoint | References(PubMed index) |
|---|---|---|---|---|---|
| 2015 | GSE12945, GSE41258, GSE14333, GSE17538, GSE29623, GSE33113, GSE39582, GSE24549, GSE24550, GSE30378, GSE28722 | yes | I-IV | OS | 25853550 |
| 2015 | GSE17536 | yes | I-IV | DSS | 25622900 |
| 2015 | GSE24549, GSE24550, GSE39582 | yes | I-IV | OS | 25894381 |
| 2014 | GSE13294, GSE5206, GSE17536, GSE17537 | yes | II-III | DFS | 24486594 |
| 2014 | GSE14333, GSE17538, GSE33113, GSE31595, GSE14095, GSE26892 | yes | II-III | Relapse | 25115384 |
| 2014 | GSE14333, GSE33113, GSE17538 | yes | I-III | Relapse | 24829396 |
| 2014 | GSE17536, GSE17537, GSE38832 | yes | II-III | OS, DSS and DFS | 25320007 |
| 2014 | GSE17536, GSE30378 | yes | I-IV | DSS | 25000257 |
| 2014 | GSE17538, GSE14333 | yes | II-III | RFS | 24728738 |
| 2014 | GSE17538, GSE14333 | yes | II-III | RFS, OS | 25504183 |
| 2014 | GSE39582, GSE14333, GSE17536 | yes | I-IV | DFS | 24809982 |
| 2013 | GSE14333, GSE17538 | yes | I-IV | RFS | 23372686 |
| 2013 | GSE14333, GSE17538, GSE12032 | yes | I-IV | DFS, DSS | 23658834 |
| 2013 | GSE17536 | yes | I-IV | OS | 24247253 |
| 2013 | GSE17536 | yes | I-IV | DSS | 23799978 |
| 2013 | GSE17536, GSE14333 | yes | I-III | DFS | 22859720 |
| 2013 | GSE17536, GSE14333 | yes | I-IV | OS | 23807160 |
| 2013 | GSE17536, GSE14333, GSE12945 | yes | I-IV | OS | 24140838 |
| 2013 | GSE17536, GSE17537 | yes | I-III | OS | 23922772 |
| 2013 | GSE17536, the training data was not provided | yes | II-III | OS | 24170546 |
| 2013 | GSE17537 | yes | I-IV | RFS | 24360964 |
| 2013 | GSE17538 | yes | I -IV | OS | 24052018 |
| 2013 | GSE17538, GSE14333, GSE37892 | yes | I-IV | DFS | 23626670 |
| 2012 | GSE12032, GSE17538, GSE17181, GSE4526 | yes | II-III | Relapse | 22348113 |
| 2012 | GSE14333, GSE17538, GSE30378, GSE24550 | yes | II-III | RFS | 22991413 |
| 2012 | GSE17536, GSE14333 | yes | I-IV | RFS | 22710688 |
| 2012 | GSE17536, GSE14333 | yes | I-III | DFS | 22859720 |
| 2012 | GSE17536, GSE14333 | yes | II-III | RFS | 22844451 |
| 2012 | GSE17537, GSE14333 | yes | I-IV | RFS | 23153532 |
| 2012 | GSE29638, GSE24550, GSE30378 | no | II | RFS | 22213796 |
| 2011 | GSE5206, GSE10402 | yes | I-IV | RFS | 21098318 |
| 2011 | GSE5206, GSE17537 | yes | I-IV | OS | 22977525 |
| 2010 | GSE17538 | yes | II-III | OS, RFS | 19914252 |
| 2010 | GSE17538, GSE14333 | yes | II-III | DFS, DSS | 21119668 |
Abbreviations: RFS, relapse-free survival, also called DFS, disease-free survival. OS, overall survival. DSS, disease specific survival.
Figure 2The Kaplan-Meier curves of RFS for samples in six datasets
The CRC datasets used in this work generated on GPL570 platform
| Dataset | Stage I CRC# | Stage II CRC# without CTX | Stage II CRC# with CTX | Stage III CRC# | StageIV CRC# |
|---|---|---|---|---|---|
| GSE39582 | 33 | 203 | 56 | 205 | 60 |
| GSE14333 | 44 | 72 | 22 | 91 | 61 |
| GSE17536 | 24 | 55 | 0 | 57 | 39 |
Figure 3The Kaplan-Meier curves of RFS for stage II CRC samples stratified by the 3-GPS in the training and validation datasets
A. The training dataset GSE39582; B. The independent validation dataset GSE14333; C. The validation dataset GSE1736. A sample was classified into high-risk group (red line) only if at least two gene pairs in the 3-GPS voted for high-risk.
Multivariate Cox regression analyses of the 3-GPS
| Clinical Characteristic | HR | Cox | 95% CI |
|---|---|---|---|
| GPS(High Risk | 7.5479 | 7.28 × 10−6 | [3.121, 18.257] |
| Age | 2.7269 | 0.0423 | [1.034, 7.182] |
| Sex (Male | 0.9944 | 0.7477 | [0.961, 1.029] |
| Localization (distal vs proximal) | 1.9770 | 0.1794 | [0.731, 5.348] |
| MSI | 0.6845 | 0.6922 | [0.105, 4.471] |
| Braf mut | 3.8353 | 0.2086 | [0.472, 31.178] |
| Kras mut | 0.8843 | 0.8322 | [0.284, 2.757] |
| Tp53 mut | 1.2155 | 0.6301 | [0.549, 2.690] |
The consistency of the Risk-DE genes detected from three datasets
| Dataset1 | Dataset2 | DE genes 1 | DE genes2 | overlap | consistency |
|---|---|---|---|---|---|
| GSE39582 | GSE14333 | 3599 | 2540 | 836 | 98.09% |
| GSE39582 | GSE17536 | 3599 | 505 | 364 | 99.45% |
| GSE14333 | GSE17536 | 2540 | 505 | 247 | 100% |
The consistency between the Risk-DE genes and the Metastatic-DE genes
| Dataset1 | Risk-DE genes# | Metastatic -DE genes# | Overlap | p1 | Consistency | p2 |
|---|---|---|---|---|---|---|
| GSE39582 | 3599 | 174 | 41 | 0.0307 | 100% | 4.55 × 10−13 |
| GSE14333 | 2540 | 278 | 118 | <2.2 × 10−16 | 100% | <2.2 × 10−16 |
| GSE17536 | 505 | 45 | 12 | 6.79 × 10−10 | 100% | 2.4 × 10−4 |
Notes: #, the number of Risk-DE genes and Metastatic-DE genes; p1, the p value of overlaps between Risk-DE genes and Metastatic-DE genes; p1, the p value of the concordance score of the overlapped DE genes.
The KEGG function enrichment analysis results
| Pathway name | Adjusted | References(PubMed index) |
|---|---|---|
| ECM-receptor interaction | 2.22 × 10−14 | 9854310 |
| Focal adhesion | 7.99 × 10−10 | 15246682 |
| Protein digestion and absorption | 4.97 × 10−5 | 21490305 |
| PI3K-Akt signaling pathway | 2.98 × 10−3 | 7558426 |
| Glycosaminoglycan biosynthesis-chondroitin sulfate/dermatan sulfate | 5.29 × 10−3 | 24035453 |
| Regulation of actin cytoskeleton | 4.38 × 10−2 | 11709869 |
The distribution of stage II CRC predicted by the 3-GPS
| Datasets | L-risk CRC with CTX | L-risk CRC without CTX | H-risk CRC with CTX | H-risk CRC without CTX |
|---|---|---|---|---|
| GSE39582 | 46 | 176 | 10 | 27 |
| GSE14333 | 11 | 52 | 11 | 20 |
| All | 57 | 228 | 21 | 47 |
Abbreviations: CTX, the patients with completely resected tumors who received adjuvant chemotherapy. L-risk or H-risk represents the patients with low or high relapse risk.
Figure 4Kaplan-Meier estimates of the RFS of GSE39582 and GSE14333 patients with CTX and non-CTX patients
A. Kaplan-Meier curves for stage II CRC patients in the low relapse risk group. B. Kaplan-Meier curves for stage II CRC patients in the high relapse risk group.