| Literature DB >> 27711232 |
David J Torres1, Judy L Cannon2, Ulises M Ricoy3, Christopher Johnson4.
Abstract
Microarrays are a powerful tool for studying differential gene expression. However, lists of many differentially expressed genes are often generated, and unraveling meaningful biological processes from the lists can be challenging. For this reason, investigators have sought to quantify the statistical probability of compiled gene sets rather than individual genes. The gene sets typically are organized around a biological theme or pathway. We compute correlations between different gene set tests and elect to use Fisher's self-contained method for gene set analysis. We improve Fisher's differential expression analysis of a gene set by limiting the p-value of an individual gene within the gene set to prevent a small percentage of genes from determining the statistical significance of the entire set. In addition, we also compute dependencies among genes within the set to determine which genes are statistically linked. The method is applied to T-ALL (T-lineage Acute Lymphoblastic Leukemia) to identify differentially expressed gene sets between T-ALL and normal patients and T-ALL and AML (Acute Myeloid Leukemia) patients.Entities:
Mesh:
Year: 2016 PMID: 27711232 PMCID: PMC5053608 DOI: 10.1371/journal.pone.0163918
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Correlation of Fisher’s method with other self-contained methods for gene set analysis.
| Fisher | SAM-GS | Stouffer | Hotelling | TS | K-S | |
| Fisher | 1.0 | .99 | .98 | .88 | .87 | .77 |
| SAM-GS | .99 | 1.0 | .94 | .89 | .78 | .70 |
| Stouffer | .98 | .94 | 1.0 | .83 | .95 | .85 |
| Hotelling | .88 | .89 | .83 | 1.0 | .70 | .62 |
| TS | .87 | .78 | .95 | .70 | 1.0 | .90 |
| K-S | .77 | .70 | .85 | .62 | .90 | 1.0 |
Computed p-values of highest ranked genes required to make the entire set of K genes significant at α = .01 using Fisher’s method.
| K | p (1 gene) | p (2 genes) | p (3 genes) | p (4 genes) | p (5 genes) |
|---|---|---|---|---|---|
| 10 | 3.5 × 10−6 | 1.3 × 10−3 | 9.6 × 10−3 | 2.6 × 10−2 | 4.7 × 10−2 |
| 20 | 7.7 × 10−9 | 6.2 × 10−5 | 1.2 × 10−3 | 5.6 × 10−3 | 1.4 × 10−2 |
| 40 | 2.3 × 10−13 | 3.4 × 10−7 | 3.8 × 10−5 | 4.1 × 10−4 | 1.7 × 10−3 |
| 80 | 2.4 × 10−21 | 3.4 × 10−11 | 8.4 × 10−8 | 4.1 × 10−6 | 4.3 × 10−5 |
Fig 1Chi-squared distribution corresponding to K = 10 (20 degrees of freedom).
Fig 2Difference in probability density functions (Modified Fisher’s method (Mod FM) and Fisher’s method (FM)) at different minimum p-values.
Minimum number of genes required to achieve a global gene set significance of α = .01 using the modified Fisher’s method at different levels of p.
| Number of genes (K) | ||||
|---|---|---|---|---|
| 10 | 5 | 3 | 2 | 1 |
| 20 | 7 | 5 | 3 | 1 |
| 40 | 11 | 7 | 5 | 1 |
| 80 | 17 | 12 | 8 | 1 |
Fig 3Proportion of genes required to achieve a global gene set significance of α = .01 using the modified Fisher’s method at different levels of p and different gene set sizes.
Power of different self-contained gene set methods for different patient sizes, K = 20 genes, and 10,000 gene sets.
Expressions levels are created by sampling from a standard normal distribution (σ = 1) with a mean of (μ = 0) for the first set and a mean of (μ = .15) for the second set.
| Number of Patients | Fisher | Fisher | Fisher | SAM-GS | Stouffer | TS | K-S | |
|---|---|---|---|---|---|---|---|---|
| 50 | .45 | .45 | .44 | .44 | .44 | .37 | .36 | .32 |
| 100 | .84 | .84 | .83 | .83 | .83 | .79 | .74 | .68 |
| 200 | .997 | .996 | .996 | .996 | .996 | .994 | .986 | .977 |
Fraction of Type I errors for gene set methods, K = 20 genes, n = 100 patients, 10,000 gene sets.
Expressions levels for each gene set are created by sampling from a standard normal distribution (μ = 0, σ = 1).
| Fisher | Fisher | SAM-GS | Stouffer | TS | K-S | ||
|---|---|---|---|---|---|---|---|
| Fraction of Type I errors | .049 | .05, (.049) | .051 | .049 | .051 | .049 | .048 |
Fig 4ln(p) vs ln(p) at different correlation levels (r) where p represents the correlated p-value and p represents the uncorrelated p-value.
Relationship between p (uncorrelated p-values) and p (correlated p-values) at correlation level r = .05.
| 10−2 | 10−4 | 10−6 | 10−8 | 10−10 | 10−12 | |
|---|---|---|---|---|---|---|
| .12 | .034 | .01 | .003 | .00087 | .00025 | |
| 12 | 340 | 1.0 × 104 | 3.0 × 105 | 8.7 × 106 | 2.5 × 108 |
Gene sets and associated p-values that are differentially expressed (T-ALL versus Healthy) using Gene Expression Omnibus Accession GSE46170 and a False Discovery Rate of.0025.
Individual genes within each set can be found at software.broadinstitute.org/gsea/msigdb [2]. Individual gene p-values are computed with the Wilcoxon rank-sum test. Gene sets identified with an asterisk (*) were also identified by Stouffer’s method. Description of gene sets in Table 7 taken from Subramanian et al. [2]. 1. “Deregulation of CDK5 in Alzheimers Disease” 2. “Genes involved in Pre-NOTCH Transcription and Translation” 3. “Genes involved in Regulation of Complement cascade” 4. “Genes involved in p38MAPK events” 5. “Oxidative Stress Induced Gene Expression Via Nrf2” 6. “Genes involved in Signaling by BMP” 7. “Genes involved in Elevation of cytosolic Ca2+ levels” 8. “Genes up-regulated during formation of blood vessels (angiogenesis)” 9. “Genes involved in Synthesis, Secretion, and Inactivation of Glucose-dependent Insulinotropic Polypeptide (GIP)”.
| GENE SET | DATABASE | Number of genes | p-value | |
|---|---|---|---|---|
| 1 | (*)BIOCARTA_ | Biocarta | 11 (11) | 1 × 10−6 |
| 2 | (*)REACTOME_PRE_NOTCH_ | Reactome | 29 (27) | 1 × 10−6 |
| 3 | (*)REACTOME_REGULATION_ | Reactome | 14 (13) | 1 × 10−6 |
| 4 | (*) REACTOME_P38MAPK_EVENTS | Reactome | 13 (13) | 1 × 10−6 |
| 5 | BIOCARTA_ARENRF2_PATHWAY | Biocarta | 13 (13) | 2 × 10−6 |
| 6 | REACTOME_SIGNALING_BY_BMP | Reactome | 23 (22) | 2 × 10−6 |
| 7 | (*)REACTOME_ELEVATION_ | Reactome | 10 (8) | 2 × 10−6 |
| 8 | HALLMARK_ANGIOGENESIS | Hallmark | 36 (36) | 2 × 10−6 |
| 9 | (*)REACTOME_SYNTHESIS_ | Reactome | 14 (12) | 2.1 × 10−5 |
Fig 5−log10(p) of genes in REACTOME_PRE_NOTCH_TRANSCRIPTION_AND_TRANSLATION.
The Wilcoxon rank-sum test is used to compute the p-values of each gene.
Fig 6The multiplicative factor P(A | B)/P(A) is the increased probability that gene A is differentially expressed (at p-value = .05 or less) given the differential expression of gene B (at p-value = .05 or less) for REACTOME_PRE_NOTCH_TRANSCRIPTION_AND_TRANSLATION.
Fig 7Relative frequency distribution of p-values in GSE13204 when comparing 174 T-ALL with 74 normal patients using the t-test.
Note the high percentage of genes with very low p-values.
Gene sets that are differentially expressed (T-ALL versus Healthy) using Gene Expression Omnibus Accession GSE13204 using a False Discovery Rate of 1 × 10−70.
Individual genes within each set can be found at software.broadinstitute.org/gsea/msigdb [2]. Individual gene p-values are computed with the t-test. Description of gene sets in Table 8 from Subramanian et al. [2]. 1. “Genes down-regulated in response to ultraviolet (UV) radiation” 2. “Genes involved in Signalling by NGF” 3. “Cell-matrix adhesions play essential roles in important biological processes including cell motility, cell proliferation, cell differentiation, regulation of gene expression and cell survival. At the cell-extracellular matrix contact points, specialized structures are formed and termed focal adhesions, where bundles of actin filaments are anchored to transmembrane receptors of the integrin family through a multi-molecular complex of junctional plaque proteins.” 4. “Genes down-regulated by KRAS activation” 5. “Regulation of actin cytoskeleton” 6. “Genes up-regulated in response to low oxygen levels (hypoxia)” 7.“Endocytosis is a mechanism for cells to remove ligands, nutrients, and plasma membrane (PM) proteins, and lipids from the cell surface, bringing them into the cell interior.” 8.“Genes encoding components of apical junction complex”.
| GENE SET | DATABASE | Number of genes | |
|---|---|---|---|
| 1 | HALLMARK_UV_RESPONSE_DN | MSigDB Hallmark | 144 (142) |
| 2 | REACTOME_SIGNALLING_BY_NGF | Reactome | 217 (211) |
| 3 | KEGG_FOCAL_ADHESION | KEGG pathway | 201 (197) |
| 4 | HALLMARK_KRAS_SIGNALING_DN | MSigDB Hallmark | 200 (199) |
| 5 | KEGG_REGULATION_OF_ACTIN_CYTOSKELETON | KEGG pathway | 216 (209) |
| 6 | HALLMARK_HYPOXIA | MSigDB Hallmark | 200 (200) |
| 7 | KEGG_ENDOCYTOSIS | KEGG pathway | 183 (179) |
| 8 | HALLMARK_APICAL_JUNCTION | MSigDB Hallmark | 200 (200) |
Gene sets that are differentially expressed (T-ALL versus AML cancer) using Gene Expression Omnibus Accession GSE36133 using a False Discovery Rate of 3 × 10−20.
Individual genes within each set can be found at software.broadinstitute.org/gsea/msigdb [2]. Individual gene p-values are computed with the Wilcoxon rank-sum test. Description of gene sets in Table 9 from Subramanian et al. [2]. 1. “Genes encoding cell cycle related targets of E2F transcription factors” 2. “Genes involved in the G2/M checkpoint, as in progression through the cell division cycle” 3.“Genes important for mitotic spindle assembly” 4. “Genes involved in DNA Replication” 5. “Genes involved in Mitotic M-M/G1 phases” 6. “Genes up-regulated during transplant rejection.” 7. “Genes encoding components of the complement system, which is part of the innate immune system” 8. “Genes involved in Signalling by NGF (nerve growth factor)” 9. “Genes up-regulated by STAT5 in response to IL2 (Interleukin 2) stimulation” 10. “Genes regulated by NF-kB in response to TNF (Tumor Necrosis Factor) [GeneID = 7124]” 11. “Genes mediating programmed cell death (apoptosis) by activation of caspases”.
| GENE SET | DATABASE | Number of genes | |
|---|---|---|---|
| 1 | HALLMARK_E2F_TARGETS | MSigDB Hallmark | 200 (190) |
| 2 | HALLMARK_G2M_CHECKPOINT | MSigDB Hallmark | 200 (195) |
| 3 | HALLMARK_MITOTIC_SPINDLE | MSigDB Hallmark | 200 (198) |
| 4 | REACTOME_DNA_REPLICATION | Reactome | 192 (178) |
| 5 | REACTOME_MITOTIC_M_M_G1_PHASES | Reactome | 172 (158) |
| 6 | HALLMARK_ALLOGRAFT_REJECTION | MSigDB Hallmark | 200 (196) |
| 7 | HALLMARK_COMPLEMENT | MSigDB Hallmark | 200 (195) |
| 8 | REACTOME_SIGNALLING_BY_NGF | REACTOME | 217 (211) |
| 9 | HALLMARK_IL2_STAT5_SIGNALING | MSigDB Hallmark | 200 (194) |
| 10 | HALLMARK_TNFA_SIGNALING_VIA_NFKB | MSigDB Hallmark | 200 (197) |
| 11 | HALLMARK_APOPTOSIS | MSigDB Hallmark | 161 (153) |