| Literature DB >> 30355338 |
Shu-Ju Lin1, Tzu-Pin Lu1,2, Qi-You Yu1, Chuhsing Kate Hsiao3,4.
Abstract
BACKGROUND: Current methods for gene-set or pathway analysis are usually designed to test the enrichment of a single gene-set. Once the analysis is carried out for each of the sets under study, a list of significant sets can be obtained. However, if one wishes to further prioritize the importance or strength of association of these sets, no such quantitative measure is available. Using the magnitude of p-value to rank the pathways may not be appropriate because p-value is not a measure for strength of significance. In addition, when testing each pathway, these analyses are often implicitly affected by the number of differentially expressed genes included in the set and/or affected by the dependence among genes.Entities:
Keywords: Association study; Bayesian logistic regression; Competing pathways; Differentially expressed genes; Gene-set analysis; Pahtway score; Pathway ranking
Mesh:
Substances:
Year: 2018 PMID: 30355338 PMCID: PMC6201593 DOI: 10.1186/s12859-018-2411-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Type I error rates under different settings when the gene-gene correlation ranges from 0 (independence) to mild correlation (ρ=0.3), and when data were generated from GSE48091 to preserve the correlation in real application
| 50 genes per set | 100 genes per set | |||||||
|---|---|---|---|---|---|---|---|---|
| GSE48091 | GSE48091 | |||||||
| Bayesian | 0.023 | 0.022 | 0.022 | 0.035 | 0.024 | 0.022 | 0.022 | 0.025 |
| Logistic (ps) | 0.043 | 0.045 | 0.046 | 0.066 | 0.041 | 0.040 | 0.041 | 0.060 |
| Logistic (sum) | 0.040 | 0.040 | 0.040 | 0.160 | 0.036 | 0.036 | 0036 | 0.041 |
| GSEA | 0.047 | 0.047 | 0.048 | 0.069 | 0.039 | 0.046 | 0.045 | 0.042 |
| Global | 0.031 | 0.030 | 0.033 | 0.131 | 0.047 | 0.046 | 0.050 | 0.159 |
| ORA | 0.096 | 0.094 | 0.091 | 0.069 | 0.057 | 0.058 | 0.060 | 0.085 |
| Fisher’s | 0.048 | 0.048 | 0.052 | 0.163 | 0.049 | 0.049 | 0.049 | 0.201 |
The size of each set is either 50 or 100. The p-values under Global and Fisher’s are derived based on 1000 permutations
Fig. 1Values of the pathway coefficients in the five simulation settings (I, II, III, IV, and V). The set size is the number of genes in the corresponding set, where the number in parentheses corresponds to genes in the pathway as well as genes in the subsets of the pathway
Fig. 2Performance evaluation. a The accuracy of selecting the correct top ranking pathway under simulation settings I-V. b The accuracy of selecting the correct top two ranking pathways under simulation settings I-V. c The accuracy of selecting the correct top ranking pathway under simulation settings VI-IX. d The accuracy of selecting the correct top two ranking pathways under simulation settings VI-IX. P-values of Global test and Fisher’s method are derived based on asymptotic approximations
Fig. 3Performance evaluation. The number is the percentage of detected association of each individual gene-set under simulation settings I and VI. P-values of Global test and Fisher’s method are derived based on asymptotic approximations
Fig. 4a The heatmap of expression counts of genes in the Jak-STAT signaling pathway. b The corresponding ranks of expression counts in the same pathway. c The summation of expression counts for each sample. d The proposed pathway score for each sample
P-values or P( of each pathway under different methods
| p53 | estrogen | Jak-STAT | mTOR | oocyte meiosis | taste transduction | |
|---|---|---|---|---|---|---|
| Size | 68 (290) | 99 (838) | 158 (1039) | 60 (433) | 124 (499) | 83 (247) |
| Bayesian | 0.717 | 0.736 |
| 0.722 | 0.611 |
|
| Logistic (sum) | 0.386 |
| 0.009 |
| 0.038 |
|
| GSEA | 0.264 | 0.134 | 0.306 |
|
| 0.228 |
| SPIA |
| 0.396 | 0.266 |
| 0.222 | 0.983 |
| Global | <1e-21 | <1e-17 | <1e-22 |
| <1e-16 |
|
| ORA |
| 0.101 | 0.275 |
| 0.086 | 0.083 |
| Fisher’s | <1e-214 |
| <1e-311 | <1e-314 |
|
|
Numbers underlined and in boldface indicate the most influential pathway (top-ranked) under each test; while numbers in boldface represent the least influential pathway. The second row lists the number of genes in each pathway, where the number in parentheses includes the genes in sub-pathway. The p-values under Global and Fisher’s are asymptotic approximates
Fig. 5Boxplots of posterior samples of each pathway coefficient under the Bayesian model
Probability P(, mean of the regression coefficient β, and if over- or under-expressed in the diseased group when different reference gene is considered in the JAK-STAT pathway for the breast NGS study
| Reference Gene |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Gene symbol | 3952 | 1271 | 2690 | 3575 | 53,833 | 51,561 |
|
| 0.889 | 0.886 | 0.880 | 0.873 | 0.844 | 0.812 |
| Mean | 9.05 | 9.03 | 8.54 | 6.71 | −8.23 | −7.68 |
| Over/under | over | over | over | over | under | under |
Fig. 6The relationship between the percentage of DE genes and the negative base 10 logarithm of p-value under ORA (left, for 289 pathways in KEGG) and SPIA (right, for 130 signaling pathways in KEGG) test, respectively. The linear correlation is 0.80 in the left and 0.49 in the right