| Literature DB >> 27849044 |
Zsófia Sztupinszki1,2, Balázs Győrffy1,2.
Abstract
Multiple gene-expression-based subtypes have been proposed for the molecular subdivision of colon cancer in the last decade. We aimed to cross-validate these classifiers to explore their concordance and their power to predict survival. A gene-chip-based database comprising 2,166 samples from 12 independent datasets was set up. A total of 22 different molecular subtypes were re-trained including the CCHS, CIN25, CMS, ColoGuideEx, ColoGuidePro, CRCassigner, MDA114, Meta163, ODXcolon, Oncodefender, TCA19, and V7RHS classifiers as well as subtypes established by Budinska, Chang, DeSousa, Marisa, Merlos, Popovici, Schetter, Yuen, and Watanabe (first authors). Correlation with survival was assessed by Cox proportional hazards regression for each classifier using relapse-free survival data. The highest efficacy at predicting survival in stage 2-3 patients was achieved by Yuen (p = 3.9e-05, HR = 2.9), Marisa (p = 2.6e-05, HR = 2.6) and Chang (p = 9e-09, HR = 2.35). Finally, 61 colon cancer cell lines from four independent studies were assigned to the closest molecular subtype.Entities:
Mesh:
Year: 2016 PMID: 27849044 PMCID: PMC5111107 DOI: 10.1038/srep37169
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Summary of pooled database setup and classifier selection.
A flowchart depicting dataset identification starting with the combination of “colon cancer” and the three different platforms in GEO (A). Composition of the entire database – the 12 datasets included and basic clinical characteristics – sample numbers are given for TNM because these data were available for only a fraction of the patients (B). Correlation between survival, TNM and stage in the entire database (C). Identification of classifiers through a PubMed search (D).
Summary of the implemented classifiers.
| Test | Eligibility | Classification technique | Original training set | Original validation set | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| name | reference | FF/FP | stage | cohorts | methods for classification | gene n | platform | GSE datasets | size (n) | validation GSE datasets | size (n) |
| Budinska | FF, FP | II–III | 5 | linear discriminant analysis | 658 | ALMAC, U133 + 2, U133A | 14333, 2109, 17537, PETACC3 | 425 | 4107, 4183, 10714, 15960, 13294, 18088, 26682, 26906, TCGA | 720 | |
| CCHS | FF,FP | II–III | 2 | sum score, pre-computed cutoff | 6 | U133 + 2 | 13294, 5206, 17537, 17536 | 229 | FFPE samples | 126 | |
| Chang95 | FF | I–III | 3 | median of mean expression | 95 | ALMAC, U133 + 2, U133A, HuEx-1_0-st | 28702, 5206 | 188 | 17536, 17537, 14333, 37892, 12945, 41258, 24551, E-MTAB-863, E-MTAB-864 | 682 | |
| CIN25 | FF | — | 2 | median of mean expression | 25 | U133 + 2, U133A, Rosetta | multiple (n = 18) | 1,944 | NA | NA | |
| CMS | FF, FP | — | 4 | centroid-based predictor | 693 | Agilent, RNA–seq, Affymetrix | multiple (n = 18) | 1,721 | as of training (one dataset was split into equal training and validation) | 1,721 | |
| ColoGuideEx | FF | II | 2 | number of genes exceeding the 80th and 20th percentile | 13 | HuEx-1_0-st | 24550, 29638, 30378 | 112 | 24550, 29638, 30378, 14333, 17538 | 203 | |
| ColoGuidePro | FF | II–III | 2 | 7 | HuEx-1_0-st | 30378 | 95 | 14333, 17538, 24550 | 290 | ||
| CRCassigner-786 | FF | — | 5 | PAM | 786 | U133 + 2 | 13294, 14333 | 445 | 13294, 14333, 12945, 16125, 20916, 20842, 21510, TCGA, 28722 | 744 | |
| DeSousa | FF | — | 3 | PAM | 146 | U133 + 2 | 33114 | 90 | 14333, 17538, 13294, 13067, 5851, 28702, 35144, E-MTAB-991, TCGA | 1,074 | |
| Marisa | FF | — | 6 | centroid-based predictor | 57 | U133 + 2 | 39582 | 443 | 13067, 13294, 14333, 17536/17537, 18088, 26682, 33113, TCGA | 1,181 | |
| MDA114 | FF | II–III | 2 | compound covariate predictor | 114 | U133 + 2 | 17536 | 179 | 17537, 12945, 14333 | 213 | |
| Merlos-EphB2-ISC | FF | — | 3 | average signature expression | 29 | Affymetrix mouse4302 | Mouse dataset 6894 | 18 | 17538, 14333 | 340 | |
| Merlos-Lgr5-ISC | FF | — | 3 | average signature expression | 64 | Affymetrix mouse4302 | Mouse dataset 6894 | 18 | 17538, 14333 | 340 | |
| Meta163 | FF | II–III | 2 | PAM | 128 | U133 + 2 | 5206, 14333 | 188 | 14333 | 99 | |
| ODXcolon | FP | II–III | 3 | weighted score | 12 | RT-PCR based | — | 1851 | — | 711 | |
| Oncodefender | FP | I–II | 2 | multiplication of signature genes | 5 | RT-PCR based | — | 74 | — | 264 | |
| Popovici | FP, FF | II–III | 2 | closest mean expression | 64 | ALMAC-array | PETACC-3 | 688 | 2138, 17538, ALMAC | 114 | |
| Schetter | FF | — | 2 | weighted sum | 9 | RT-PCR based | — | 57 | — | 139 | |
| TCA19 | FF | III | 2 | median | 19 | U133 + 2 | 39582 | 566 | 14333, 33113, 37892 | 449 | |
| Yuen3 | FF | II–III | 4 | median per gene | 3 | U133 + 2 | 14333, 17538 | 458 | — | — | |
| V7RHS | FP | II | 2 | weighted sum | 7 | RT-PCR based | — | 233 | — | — | |
| Watanabe-CIN | FF | II–III | 2 | SVM | 112 | U133 + 2 | 30540 | 845 | 14333 | 290 | |
Abbreviations: FP: FFPE; FF: fresh frozen; U133: Affymetrix HG-U133A; U133 + 2: Affymetrix HG-U133Plus2.0; ALMAC: Affymetrix Almac Xcel Array for FFPE; Rosetta: Rosetta custom 25 K array.
Figure 2Concordance of classifier output and gene composition.
Concordance across all classifiers in all samples (A) and for those identifying a sample as having a bad (red) or good (green) prognosis for the top eight classifiers in stage II/III patients (B). In each case, a Cramer’s V of 1 represents perfect concordance and 0 equals complete discordance. Percentile overlap of the list of genes included in the classifiers (C). An example for interpretation: Budinska contains 100% of the genes in the CIN25 signature, but these account for only 4% of all Budinska genes. V4HRS, Meta163 and Oncodefender are not included as these had less than 1% overlap against any other classifier.
Figure 3Relative classifier performance in stage II/III patients.
A forest plot showing the performance of all classifiers in stage II and III patients (A). Samples included in the original training sets were excluded from the validation analysis (*with the exception of CMS). For classifiers with more than two outputs, the best- and worst-performing cohorts were compared when computing the hazard rate. Kaplan-Meier plot for the best-performing classifiers including Yuen3 (B), Chang95 (C), Marisa (D) and ODXcolon (E). A higher percentage of genes included in the classifier significant in univariate analysis results in a higher hazard rate achieved by the classifier (*all patients for CMS) (F). The dotted lines represent the proportion of genes significant at p = 0.05 and p = 0.01 among all genes on the gene chips in univariate analysis.
Assignment of preclinical models to the closest subtype.
Assignment of the most frequently utilized cell lines to each of the molecular classifiers based on a gene-array-based expression profile of the cell line. In case multiple arrays were utilized for a given cell line, and more than 40% of the arrays delivered different results, then the given cell line was not classified for that classificator. Mutation status is shown for the six most important genes, and MSI status is also shown for each cell line. Abbreviations: M: mutant, WT: wild type, H: high, L: low, int: intermediate, Bm: BRAF mutant, MSH: MSI-High, MSS: MSI Stable, Sample n: number of gene chip samples providing expression data for the classification.
Figure 4Linking prevalence and preclinical models.
Subtype designation proportion for each classifier in all patients (left column, in percentage), and number of cell lines available for the given subtype (right column, n) for the best-performing classifiers. Grey corresponds to NA in each classifier. There are no representative cell line models for some of the subtypes.