| Literature DB >> 27729437 |
Gregory P Way1,2, James Rudd3, Chen Wang4, Habib Hamidi5, Brooke L Fridley6, Gottfried E Konecny5, Ellen L Goode4, Casey S Greene7,8, Jennifer A Doherty9.
Abstract
Four gene expression subtypes of high-grade serous ovarian cancer (HGSC) have been previously described. In these early studies, a fraction of samples that did not fit well into the four subtype classifications were excluded. Therefore, we sought to systematically determine the concordance of transcriptomic HGSC subtypes across populations without removing any samples. We created a bioinformatics pipeline to independently cluster the five largest mRNA expression datasets using k-means and nonnegative matrix factorization (NMF). We summarized differential expression patterns to compare clusters across studies. While previous studies reported four subtypes, our cross-population comparison does not support four. Because these results contrast with previous reports, we attempted to reproduce analyses performed in those studies. Our results suggest that early results favoring four subtypes may have been driven by the inclusion of serous borderline tumors. In summary, our analysis suggests that either two or three, but not four, gene expression subtypes are most consistent across datasets.Entities:
Keywords: molecular subtypes; ovarian cancer; reproducibility; unsupervised clustering
Mesh:
Year: 2016 PMID: 27729437 PMCID: PMC5144978 DOI: 10.1534/g3.116.033514
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Characteristics of the populations included in the five analytic datasets
| TCGA | Mayo | Yoshihara | Tothill | Bonome | |
|---|---|---|---|---|---|
| GEO | GSE74357 | GSE32062 | GSE9891 | GSE26712 | |
| Platform | Affymetrix HGU1133 | Agilent 4x44K | Agilent 4x44K | Affymetrix HGU1133 | Affymetrix HGU1133 |
| Population | United States | United States | Japan | Australia | United States |
| Original sample size | 578 | 528 | 260 | 285 | 195 |
| Analytic sample size | 499 | 379 | 256 | 242 | 185 |
| Age [Mean (SD)] | 60.0 (11.6) | 62.9 (11.3) | NR | 60.3 (10.3) | 61.5 (11.9) |
| Stage | |||||
| I | 10 (2%) | 7 (3%) | 0 (0%) | 11 (5%) | 0 (0%) |
| II | 17 (4%) | 11 (3%) | 0 (0%) | 8 (4%) | 0 (0%) |
| III | 351 (80%) | 275 (73%) | 202 (79%) | 178 (83%) | 146 (80%) |
| IV | 63 (14%) | 86 (23%) | 54 (21%) | 17 (8%) | 36 (20%) |
| Grade | |||||
| 2 | 55 (12%) | 3 (1%) | 130 (51%) | 80 (37%) | NR |
| 3 | 386 (88%) | 376 (99%) | 126 (49%) | 134 (63%) | NR |
| Debulking | |||||
| Optimal | 325 (74%) | 287 (76%) | 101 (39%) | 132 (62%) | 89 (49%) |
| Suboptimal | 116 (26%) | 87 (23%) | 155 (61%) | 82 (38%) | 93 (51%) |
TCGA, The Cancer Genome Atlas; NR, data not reported.
Samples without survival data were excluded in survival analyses.
One sample was labeled as “Grade 4” in TCGA.
Figure 1Significance analysis of microarray (SAM) moderated t score Pearson correlation heatmaps reveal consistency across datasets. (A) Correlations across datasets for k means k = 2. (B) Correlations across datasets for k means k = 3. (C) Correlations across datasets for k means k = 4. TCGA, The Cancer Genome Atlas.
SAM moderated t score vector Pearson correlations between analogous clusters across populations
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | |
|---|---|---|---|---|
| 0.62–0.81 | 0.62–0.81 | NR | NR | |
| 0.77–0.85 | 0.80–0.90 | 0.65–0.77 | NR | |
| 0.77–0.85 | 0.83–0.89 | 0.51–0.76 | 0.61–0.75 | |
| Bonome | −0.08–0.24 | −0.08–0.24 | NR | NR |
| Bonome | 0.45–0.46 | −0.02–0.12 | 0.22–0.42 | NR |
| Bonome | 0.50–0.57 | −0.04–0.04 | 0.13–0.29 | 0.26–0.43 |
TCGA, The Cancer Genome Atlas; NR, data not reported.
Correlation ranges for TCGA, Mayo, Yoshihara, and Tothill.
Bonome is removed from gene set analyses because of low correlating clusters.
Figure 2Significance analysis of microarray (SAM) moderated t score Pearson correlation heatmaps of clusters formed by k means clustering and NMF clustering reveals consistency across clustering methods. Within dataset results are shown for both methods when setting each algorithm to find 2, 3, and 4 clusters. NMF, nonnegative matrix factorization; TCGA, The Cancer Genome Atlas.
Distributions of sample membership in the clusters identified in our study by the original cluster assignments in the TCGA, Tothill, and Konecny studies
| TCGA | Tothill | Konecny | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mes | Pro | Imm | Dif | NC | C1 | C2 | C3 | C4 | C5 | C6 | NC | C1 | C2 | C3 | C4 | NA | |
| Cluster 1 | 98 | 7 | 93 | 68 | 21 | 78 | 39 | 1 | 0 | 0 | 0 | 11 | 36 | 21 | 2 | 26 | 114 |
| Cluster 2 | 1 | 127 | 2 | 60 | 22 | 0 | 5 | 5 | 44 | 35 | 2 | 22 | 6 | 39 | 41 | 0 | 94 |
| Cluster 1 | 98 | 2 | 20 | 11 | 6 | 77 | 22 | 0 | 0 | 0 | 0 | 6 | 16 | 13 | 2 | 26 | 82 |
| Cluster 2 | 1 | 111 | 0 | 11 | 16 | 1 | 0 | 0 | 3 | 35 | 2 | 5 | 0 | 16 | 36 | 0 | 56 |
| Cluster 3 | 0 | 21 | 75 | 106 | 21 | 0 | 22 | 6 | 41 | 0 | 0 | 22 | 26 | 31 | 5 | 0 | 70 |
| Cluster 1 | 97 | 4 | 12 | 12 | 5 | 74 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 12 | 3 | 25 | 62 |
| Cluster 2 | 1 | 85 | 0 | 0 | 13 | 1 | 0 | 0 | 1 | 34 | 2 | 5 | 0 | 9 | 31 | 0 | 41 |
| Cluster 3 | 0 | 5 | 80 | 3 | 12 | 3 | 42 | 0 | 1 | 1 | 0 | 14 | 29 | 6 | 0 | 1 | 57 |
| Cluster 4 | 1 | 40 | 3 | 113 | 13 | 0 | 2 | 6 | 42 | 0 | 0 | 14 | 6 | 33 | 9 | 0 | 48 |
Clusters identified in our study using k-means clustering with k = 2, k = 3, and k = 4. The corresponding labels for the generally similar HGSC gene expression subtypes observed in the TCGA, Tothill, and Konecny studies are, respectively: mesenchymal/C1/C4, proliferative/C5/C3, immunoreactive/C2/C1, and differentiated/C4/C2). TCGA, The Cancer Genome Atlas; Mes, mesenchymal; Pro, proliferative; Imm, immunoreactive; Dif, differentiated; NC = samples not clustered in original publication; NA = samples not assessed at the time of the original publication.
Figure 3Comparing NMF consensus clustering in the Tothill dataset. Data displays consensus clustering for k = 2 to k = 6 for 10 NMF initializations alongside the cophenetic correlation results for k = 2 to k = 8. (A) Tothill dataset (n = 260) with borderline samples (n = 18) not removed prior to clustering. (B) Tothill dataset with borderline samples removed (n = 242).