| Literature DB >> 30253799 |
Siyuan Ma1, Shuji Ogino2, Princy Parsana3, Reiko Nishihara2, Zhirong Qian2, Jeanne Shen4, Kosuke Mima2, Yohei Masugi2, Yin Cao5, Jonathan A Nowak6, Kaori Shima2, Yujin Hoshida2, Edward L Giovannucci5, Manish K Gala7, Andrew T Chan7, Charles S Fuchs2, Giovanni Parmigiani8, Curtis Huttenhower1, Levi Waldron9,10.
Abstract
BACKGROUND: Previous approaches to defining subtypes of colorectal carcinoma (CRC) and other cancers based on transcriptomes have assumed the existence of discrete subtypes. We analyze gene expression patterns of colorectal tumors from a large number of patients to test this assumption and propose an approach to identify potentially a continuum of subtypes that are present across independent studies and cohorts.Entities:
Keywords: Colon cancer; Progression; Transcriptional profiling; Tumor
Mesh:
Year: 2018 PMID: 30253799 PMCID: PMC6154428 DOI: 10.1186/s13059-018-1511-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Clinical characteristics of selected training and validation sets used in this study
| Dataset | Accession ID | Platform | Tumor / Normal samples (n) | Late stage tumors (%) | Staging system | Availability of metastasis info |
|---|---|---|---|---|---|---|
| Training sets | ||||||
| Jorissen and Sieber, 2008b [ | GSE13294 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 155/0 | – | – | No |
| Watanabe and Hashimoto, 2008 [ | GSE14095 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 189/0 | – | – | No |
| Jorissen and Sieber, 2008 [ | GSE14333 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 290/0 | 77.55 | TNM/Duke | Yes |
| Smith and Beauchamp, 2009a [ | GSE17536 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 177/0 | 80 | TNM/Duke | Yes |
| Mori, Mimori, Yokobori T, 2010 [ | GSE21815 | Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) | 131/9 | 59.54 | TNM/Duke | Yes |
| Vilar and Morgan, 2011a [ | GSE26682.GPL570 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 176/0 | – | – | No |
| Vilar and Morgan, 2011b [ | GSE26682.GPL96 | [HG-U133A] Affymetrix Human Genome U133A Array | 155/0 | – | – | No |
| NHS-HPFS [ | GSE32651 | Illumina DASL HumanRef-8 v3 | 718/0 | 13.83 | TNM | No |
| Validation sets | ||||||
| Lips and Morreau, 2008 [ | GSE12225.GPL3676 | NKI-CMF | 42/0 | 28.57 | TNM | Yes |
| Staub and Rosenthal, 2009 [ | GSE12945 | [HG-U133A] Affymetrix Human Genome U133A Array | 62/0 | 41.94 | TNM | Yes |
| Jorissen and Sieber, 2008a [ | GSE13067 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 33/0 | – | – | No |
| Smith and Beauchamp, 2009b [ | GSE17538.GPL570 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 63/0 | 88.1 | TNM/Duke | Yes |
| expO, IGC, 2005 | GSE2109 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 427/0 | 51.6 | TNM/Duke | Yes |
| Tsukamoto and Sugihara, 2010 [ | GSE21510 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 123/25 | 79.57 | TNM/Duke | Yes |
| Medema and Tanis, 2011 [ | GSE33113 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 90/6 | – | TNM/Duke | Yes |
| Marisa and Boige, 2012 [ | GSE39582 | [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array | 566/0 | 87.75 | TNM/Duke | Yes |
| TCGAa [ | TCGA.COAD | Agilent 244 K Custom Gene Expression G4502A-07-3 | 122/4 | 42.4 | TNM | Yes |
| TCGAb [ | TCGA.RNASeqV2 | [RNASeqV2] Illumina HiSeq RNA sequencing | 181/14 | 53.09 | TNM | Yes |
The normal samples in these datasets were all from adjacent normal tissues. The percentage of late-stage and high-grade samples were calculated where the information is available
Fig. 1Overview of analyses performed in this study. Shown here are the steps carried out to examine the validity of discrete subtypes, as well as to identify, validate, and characterize continuously variable subtypes for CRC transcriptomes
Fig. 2Previously published CRC subtypes do not separate samples’ transcriptional profiles. Average silhouette widths between the previously reported CMS [10] subtypes provide no evidence for substantial clustering structure. Silhouette widths for separation between CMS subtypes are calculated within each of the 18 datasets used in our study, either for all samples or only for those confidently labeled by the CMS classifier as provided in [10]. Distribution of samples’ silhouette widths is represented with box plots, with diamonds marking the average. Studies are separated according to training and validation sets, and then ranked based on their average silhouette widths. Datasets also used in the CMS paper are marked in red. The reference levels of clustering (horizontal gray dashed lines) are the same as in [25]. These results are not sensitive to dissimilarity measures (Additional file 1: Figure S1B)
Fig. 3Correlated PCs from training datasets form densely connected clusters, characterizing robust major transcriptional shifts. These can then be used as basis for continuous subtype scores. Each node represents one of the top 20 PCs in one dataset (ds). Edges indicate an absolute Pearson correlation of at least 0.5 between the corresponding loading vectors (singletons are not included in the figure). Node size is proportional to its degree (the number of PCs that it is correlated with), and edge width is proportional to Pearson correlation. Clusters were identified based on the Girvan-Newman algorithm [49], which separated four large clusters, each corresponding to a recurrent “spectrum” of subtype scores (i.e. a pattern of coordinated gene expression differential across subjects within a dataset and recurring in multiple datasets). For the first seven training datasets, the PCs present in the four clusters were all top PCs, which means that the strongest signals for these datasets are all true signals. For the NHS/HPFS dataset, however, PCs 1–6 were missing. This suggests a strong batch effect (noise) in this particular dataset
Estimated overall effect size and p values for continuous scores on molecular, histopathological, and clinical variables from fixed effects model
| Variable | Continuous score | Effect size | |
|---|---|---|---|
| CMS1 subtype | PCSS1 | 0.82 | 3E-55 |
| PCSS2 | − 2.55 | 1E-129 | |
| CMS2 subtype | PCSS1 | − 2.00 | 2E-156 |
| PCSS2 | 0.76 | 7E-60 | |
| CMS3 subtype | PCSS1 | − 0.57 | 2E-28 |
| PCSS2 | − 0.75 | 6E-56 | |
| CMS4 subtype | PCSS1 | 1.72 | 2E-130 |
| PCSS2 | 1.75 | 4E-106 | |
| MSI | PCSS1 | 0.76 | 1E-31 |
| PCSS2 | − 1.68 | 5E-71 | |
| Right location | PCSS1 | 0.087 | 0.09 |
| PCSS2 | − 0.23 | 1E-04 | |
| Late stage | PCSS1 | 0.16 | 0.002 |
| PCSS2 | 0.24 | 6E-06 | |
| High grade | PCSS1 | 0.33 | 3E-05 |
| PCSS2 | − 0.30 | 7E-05 | |
| Disease recurrence or death | PCSS1 | 0.23 | 5E-05 |
| PCSS2 | 0.19 | 0.001 |
The effect size statistic is log hazard ratio for disease recurrence or death and log odds ratio for all other variables. These estimates are not sensitive to fixed vs random effects modeling (Additional file 5: Table S4). Statistics for individual datasets, including I2 statistics, are also provided in Additional file 5: Table S4
Fig. 4Continuous subtype scores consistently reproduce CMS subtypes, but provide additional information in characterizing molecular/histopathological/clinical correlates. a Subtype- and study-specific mean PCSS1 and PCSS2 scores indicate “quadrants” in the distribution of the continuous scores that correspond to CMS1–4 subtypes, with unlabeled samples clustered at the origin. Each point indicates the average PCSS1 and PCSS2 value of samples classified as a particular CMS subtype in one dataset, with error bars representing standard deviation. b CMS4 subtype has the worst DFS outcome in all samples where survival information is available, agreeing with results in [10], but stratification of CMS4 samples with respect to continuous scores reveals an even more highly at-risk subgroup at the extreme end of PCSS1/PCSS2 distributions. Individual hazard ratios for each study are included in Additional file 1: Figure S9C). Continuous scores are more closely associated with molecular and clinical/pathological variables than discrete subtypes. Molecular, histopathological, and clinical variables were regressed on subtypes and scores as covariates. LRTs were used to compare the full model, containing both subtype and score as predictors, to a simplified model containing only subtype (left) or score (right) as predictor. Test results for different datasets (p values) are represented by points in the box plots. A p value near 1 (−log-10 p value near 0) suggests that no additional information is provided by the full model, whereas a small p value suggests that the full model provides additional information for predicting molecular/clinical variables. The more significant p values for models using only discrete subtypes (left) vs continuous scores (right) suggest that discrete subtypes alone lack information provided by the full model; conversely, log-10 p values near zero for scores (right) suggest that continuous scores outperform discrete subtypes in characterizing the molecular and clinical/pathological variables