| Literature DB >> 20500821 |
Gad Abraham1, Adam Kowalczyk, Sherene Loi, Izhak Haviv, Justin Zobel.
Abstract
BACKGROUND: Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20500821 PMCID: PMC2895626 DOI: 10.1186/1471-2105-11-277
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sample sizes and breakdown by class
| Dataset | Good Obs. | Removed Obs. | ||
|---|---|---|---|---|
| < 5 years | ≥ 5 years | Total | ||
| GSE2034 | 82 | 165 | 247 | 8 |
| GSE4922 | 30 | 103 | 133 | 9 |
| GSE6532 | 21 | 91 | 112 | 25 |
| GSE7390 | 36 | 154 | 190 | 8 |
| GSE11121 | 28 | 154 | 182 | 18 |
Observations (samples) were removed if they were censored before the 5-year cutoff.
Figure 1Classification. Average and 95% confidence intervals for AUC from external validation between the five datasets (n = 2 × = 20 (train, test) pairs) for different numbers of features. Note that each dataset ranks its features independently, hence, the kth feature is not necessarily the same across datasets. raw denotes individual genes.
Figure 6Kolmogorov-Smirnov analysis. Kolmogorov-Smirnov enrichment for MSigDB categories, using the set-centroid statistic. A AUC and spline smooth for each set, tested on GSE11121. B Number of mapped probesets in each set, on log2 scale, and spline smooth. C Two-sample Kolmogorov-Smirnov Brownian-bridge for each MSigDB category (p-values: C1: 1.44 × 10-4, C2: 3.55 × 10-15, C3: < 2.22 × 10-16, C4: 4.22 × 10-13, C5: 2.38 × 10-2).
Top gene sets by average rank
| # | Set | Cat. | Sign | MSigDB | Enriched GO BP Terms (adj. |
|---|---|---|---|---|---|
| 1 | GNF2_MKI67 | C4 | -1 | Neighborhood of MKI67 | "phosphoinositide-mediated signaling": 1.95 × 10-10, "spindle organization": 5.86 × 10-6, "establishment of mitotic spindle localization": 1.10 × 10-5, "kinetochore assembly": 5.48 × 10-5, "mitotic chromosome condensation": 1.37 × 10-4, "protein complex localization": 2.55 × 10-3, "regulation of striated muscle development": 2.55 × 10-3, "metaphase plate congression": 2.55 × 10-3 |
| 2 | GNF2_CCNA2 | C4 | -1 | Neighborhood of CCNA2 | "phosphoinositide-mediated signaling": 4.05 × 10-16, "DNA replication": 1.04 × 10-9, "mitotic chromosome condensation": 1.32 × 10-8, "regulation of striated muscle development": 3.76 × 10-3, "metaphase plate congression": 3.76 × 10-3 |
| 3 | GNF2_TTK | C4 | -1 | Neighborhood of TTK | "phosphoinositide-mediated signaling": < 2.22 × 10-16, "mitotic chromosome condensation": 4.35 × 10-14, "DNA replication": 1.01 × 10-12, "spindle organization": 1.37 × 10-9, "establishment of mitotic spindle localization": 9.59 × 10-5, "kinetochore assembly": 4.76 × 10-4, "DNA repair": 5.78 × 10-3, "mitosis": 9.44 × 10-3 |
| 4 | GNF2_HMMR | C4 | -1 | Neighborhood of HMMR | "phosphoinositide-mediated signaling": < 2:22 × 10-16, "mitotic cell cycle spindle assembly checkpoint": 1.26 × 10-11, "spindle organization": 4.89 × 10-10, "mitotic chromosome condensation": 8.46 × 10-8, "cell proliferation": 6.22 × 10-6, "DNA replication": 1.09 × 10-5, "establishment of mitotic spindle localization": 5.33 × 10-5, "kinetochore assembly": 2.65 × 10-4, "protein complex localization": 8.29 × 10-3, "regulation of striated muscle development": 8.29 × 10-3, "metaphase plate congression": 8.29 × 10-3 |
| 5 | GNF2_CDC20 | C4 | -1 | Neighborhood of CDC20 | "phosphoinositide-mediated signaling": < 2.22_10-16, "spindle organization": 2.20 × 10-12, "mitotic cell cycle spindle assembly checkpoint": 4.07 × 10-11, "mitotic chromosome condensation": 1.52 × 10-9, "cell proliferation": 8.96 × 10-9, "mitosis": 1.83 × 10-8, "establishment of mitotic spindle localization": 8.95 × 10-5, "kinetochore assembly": 4.45 × 10-4, "DNA replication": 7.83 × 10-3 |
| 6 | GNF2_SMC2L1 | C4 | -1 | Neighborhood of SMC2L1 | "mitotic cell cycle spindle assembly checkpoint": 5.15 × 10-13, "mitotic chromosome condensation": 7.16 × 10-9, "phosphoinositide-mediated signaling": 2.14 × 10-6, "establishment of mitotic spindle localization": 1.31 × 10-5, "kinetochore assembly": 6.51 × 10-5, "protein complex localization": 2.90 × 10-3, "DNA strand elongation during DNA replication": 2.90 × 10-3, "regulation of striated muscle development": 2.90 × 10-3, "metaphase plate congression": 2.90 × 10-3, "cell proliferation": 2.94 × 10-3, "nucleotide-excision repair, DNA gap filling": 3.56 × 10-3 |
| 7 | GNF2_H2AFX | C4 | -1 | Neighborhood of H2AFX | "cell proliferation": 9.28 × 10-10, "phosphoinositide-mediated signaling": 5.54 × 10-7, "mitosis": 8.48 × 10-5, "mitotic cell cycle spindle assembly checkpoint": 1.33 × 10-4, "protein complex localization": 1.63 × 10-3 |
| 8 | GNF2_ESPL1 | C4 | -1 | Neighborhood of ESPL1 | "phosphoinositide-mediated signaling": 5.38 × 10-11, "kinetochore assembly": 3.12 × 10-5, "mitotic chromosome condensation": 6.75 × 10-5, "spindle organization": 7.76 × 10-4, "protein complex localization": 1.67 × 10-3, "regulation of striated muscle development": 1.67 × 10-3, "metaphase plate congression": 1.67 × 10-3 |
| 9 | GNF2_RRM2 | C4 | -1 | Neighborhood of RRM2 | "phosphoinositide-mediated signaling": 4.52 × 10-15, "mitotic cell cycle spindle assembly checkpoint": 1.17 × 10-9, "spindle organization": 1.20 × 10-7, "DNA replication": 5.42 × 10-6, "cell proliferation": 1.97 × 10-5, "establishment of mitotic spindle localization": 4.09 × 10-5, "kinetochore assembly": 2.03 × 10-4, "protein complex localization": 6.80 × 10-3, "regulation of striated muscle development": 6.80 × 10-3, "metaphase plate congression": 6.80 × 10-3 |
| 10 | GNF2_PCNA | C4 | -1 | Neighborhood of PCNA | "phosphoinositide-mediated signaling": < 2.22 × 10-16, "DNA replication": 1.47 × 10-15, "mitotic chromosome condensation": 2.36 × 10-7, "spindle organization": 4.33 × 10 -7, "establishment of mitotic spindle localization": 9.59 × 10-5, "cell proliferation": 4.18 × 10-4, "DNA repair": 4.33 × 10-4, "kinetochore assembly": 4.76 × 10-4, "mitosis": 9.44 × 10-3 |
Top 10 gene sets by average rank over the five datasets, using the set centroid statistic. GO enrichment p-values are from a Bonferroni-adjusted one-sided Fisher's exact test (30,330 tests). Sign = -1 if expression is negatively associated with long-term survival, and vice versa. The background list for the test includes all Affymetrix HG-U133A probesets that could be mapped to GO BP terms, excluding IEA annotations.
Figure 2Classification variance. Variance and 95% confidence intervals of the AUC from external validation between the five datasets (n = 2 × = 20 (train, test) pairs) for different numbers of features. The confidence intervals are , where is the α = 0.05 quantile for a chi-squared distribution with n - 1 degrees of freedom, s2 is the sample variance and.
Figure 3Bootstrap. Mean and 2.5%/97.5% of the ranks of genes and gene sets (set centroid statistic), over 5000 bootstrap replications of the GSE4922 dataset. The features have been sorted by their mean rank.
Figure 4External validation. Spearman rank-correlation of the centroid classifier's weights from the five datasets (n = 10 comparisons). raw denotes individual genes.
Figure 5List concordance. Concordance of feature lists (genes or gene sets) for different cutoffs f = 1,...,200, counting the number of features occurring in all of the five datasets' lists, ranked higher than f. raw denotes individual genes.
Figure 7ER/HER2 subtypes. Expression of ESR1 (ER) versus ERBB2 (HER2) for the combined dataset. A mixture of three Gaussians is fitted to the data. Clusters 1, 2, and 3 represent the ER-/HER2-, ER+/HER2-, and HER2+ subtypes, respectively.
Breakdown of samples for each cancer subtype
| Class | < 5 years | ≥ 5 years | Total | |
|---|---|---|---|---|
| 1 | ER-/HER2- | 35 | 80 | 115 |
| 2 | ER+/HER2- | 107 | 423 | 530 |
| 3 | HER2+ | 55 | 164 | 219 |
Top gene sets for each ER/HER2 subtype
| Class | # | MSigDB Set | Cat. | Description | Sign |
|---|---|---|---|---|---|
| ER-/HER2- | 1 | chr7q12 | Cl | Genes in cytogenetic band chr7q12 | 1 |
| 2 | COLLER_MYC_DN | C2 | Genes down-regulated by MYC in 293T (transformed fetal renal cell). | -1 | |
| 3 | IFNGPATHWAY | C2 | IFN gamma signaling pathway | 1 | |
| 4 | GRANDVAUX_IFN_NOT_IRF3_UP | C2 | Genes up-regulated by interferon-alpha, beta but not by IRF3 in Jurkat (T cell) | 1 | |
| 5 | GNF2_ST13 | C4 | Neighborhood of ST13 | -1 | |
| 6 | GNF2_CD48 | C4 | Neighborhood of CD48 | 1 | |
| 7 | GNF2_GLTSCR2 | C4 | Neighborhood of GLTSCR2 | -1 | |
| 8 | MENSE_HYPOXIA_DN | C2 | List of Hypoxia-suppressed genes found in both Astrocytes and HeLa Cells | -1 | |
| 9 | HSA03010_RIBOSOME | C2 | Genes involved in ribosome | -1 | |
| 10 | GCM_TPT1 | C4 | Neighborhood of TPT1 | -1 | |
| ER+/HER2- | 1 | GNF2_MKI67 | C4 | Neighborhood of MKI67 | -1 |
| 2 | GNF2_TTK | C4 | Neighborhood of TTK | -1 | |
| 3 | GNF2_HMMR | C4 | Neighborhood of HMMR | -1 | |
| 4 | GNF2_CCNA2 | C4 | Neighborhood of CCNA2 | -1 | |
| 5 | GNF2_SMC2L1 | C4 | Neighborhood of SMC2L1 | -1 | |
| 6 | GNF2_ESPL1 | C4 | Neighborhood of ESPL1 | -1 | |
| 7 | GNF2_CDC20 | C4 | Neighborhood of CDC20 | -1 | |
| 8 | GNF2_H2AFX | C4 | Neighborhood of H2AFX | -1 | |
| 9 | GNF2_RRM2 | C4 | Neighborhood of RRM2 | -1 | |
| 10 | ZHAN_MM_CD138_PR_VS_ REST | C2 | 50 top ranked SAM-defined over-expressed genes in each subgroup_PR | -1 | |
| HER2+ | 1 | chr4p | Cl | Genes in cytogenetic band chr4p | -1 |
| 2 | chrlqll | Cl | Genes in cytogenetic band chrlqll | 1 | |
| 3 | DAC_FIBRO_DN | C2 | Downregulated by DAC treatment in LD419 fibroblast cells | -1 | |
| 4 | GNF2_MKI67 | C4 | Neighborhood of MKI67 | -1 | |
| 5 | GNF2_CCNA2 | C4 | Neighborhood of CCNA2 | -1 | |
| 6 | GNF2_TTK | C4 | Neighborhood of TTK | -1 | |
| 7 | GNF2_H2AFX | C4 | Neighborhood of H2AFX | -1 | |
| 8 | GNF2_HMMR | C4 | Neighborhood of HMMR | -1 | |
| 9 | CROONQUIST_L6_RAS_DN | C2 | Genes dowmregulated in multiple myeloma cells exposed to the pro-proliferative cytokine IL-6 versus those with N-ras-activating mutations. | -1 | |
| 10 | CROONQUIST_L6_STARVE_UP | C2 | Genes upregulated in multiple myeloma cells exposed to the pro-proliferative cytokine IL-6 versus those that were IL-6-starved. | -1 | |
Top 10 MSigDB sets for ER/HER2 molecular subtypes, chosen by the centroid classifier using the set centroid statistic. Sign = -1 if expression is negatively associated with long-term survival, and vice versa.
Overlap between top genes and gene sets for different classifiers
| Classifier | # | MSigDB set | p-value | matches | set size |
|---|---|---|---|---|---|
| CC | 1 | GNF2_MKI67 | < l.00 × l0-40 | 31 | 47 |
| 2 | GNF2_TTK | < l.00 × l0-40 | 29 | 57 | |
| 3 | GNF2_CCNA2 | < l.00 × 10-40 | 48 | 99 | |
| 4 | GNF2_HMMR | < 1.00 × 10-40 | 42 | 78 | |
| 5 | GNF2_SMC2L1 | < 1.00 × 10-40 | 26 | 51 | |
| 6 | GNF2_CDC20 | < 1.00 × 10-40 | 46 | 91 | |
| 7 | GNF2_ESPL1 | < 1.00 × 10-40 | 27 | 58 | |
| 8 | GNF2_H2AFX | < 1.00 × 10-40 | 24 | 54 | |
| 9 | GNF2_RRM2 | < 1.00 × 10-40 | 32 | 68 | |
| 10 | chrlqll | 2.32 × 10-6 | 2 | 4 | |
| SVM | 1 | chr7q12 | 6.23 × 104 | 1 | 1 |
| 2 | chr3qll | 1.00 | 0 | 8 | |
| 3 | chrxq | 1.00 | 0 | 2 | |
| 4 | BYSTRYKH_RUNX1_TARGETS_GLO-CUS | 8.06 × 10-3 | 1 | 13 | |
| 5 | TESTIS_EXPRESSED _GENES | 7.28 × 10-7 | 4 | 107 | |
| 6 | chr22q | 1.00 | 0 | 6 | |
| 7 | REGULATION_OF_G_PROTEIN_COU-PLED_RECEPTOR_PROTEIN_SIGNAL-ING_PATHWAY | 4.28 × 10-4 | 2 | 48 | |
| 8 | chr11p14 | 1.00 | 0 | 20 | |
| 9 | TERCPATHWAY | 1.00 | 0 | 15 | |
| 10 | chrlq41 | 2.02 × 10-4 | 2 | 33 | |
| LR | 1 | chrSqll | 1.00 | 0 | 8 |
| 2 | chr22q | 1.00 | 0 | 6 | |
| 3 | TERCPATHWAY | 1.00 | 0 | 15 | |
| 4 | chrxq | 1.00 | 0 | 2 | |
| 5 | BYSTRYKH_RUNX1_TARGETS_GLO-CUS | 8.06 × 10-3 | 1 | 13 | |
| 6 | HSA00130_UBIQUINONE_BIOSYNTHE-SIS | 1.00 | 0 | 8 | |
| 7 | chr20p | 1.00 | 0 | 2 | |
| 8 | chrlq41 | 1.29 × 10-6 | 3 | 33 | |
| 9 | chr3q12 | 1.00 | 0 | 23 | |
| 10 | BETA_TUBULIN_BINDING | 1.00 | 0 | 12 | |
Top 10 sets using the set centroid statistic using different classifiers, and the p-value for the number of top genes belonging to each of them (Fisher's exact test, one sided). CC is centroid classifier, LR is logistic regression.