| Literature DB >> 28184252 |
Xing Ren1, Qiang Hu2, Song Liu2, Jianmin Wang2, Jeffrey C Miecznikowski1.
Abstract
BACKGROUND: In gene set analysis, the researchers are interested in determining the gene sets that are significantly correlated with an outcome, e.g. disease status or treatment. With the rapid development of high throughput sequencing technologies, Ribonucleic acid sequencing (RNA-seq) has become an important alternative to traditional expression arrays in gene expression studies. Challenges exist in adopting the existent algorithms to RNA-seq data given the intrinsic difference of the technologies and data. In RNA-seq experiments, the measure of gene expression is correlated with gene length. This inherent correlation may cause bias in gene set analysis.Entities:
Keywords: Gene length bias; Gene set analysis; Maxmean statistic; RNA-seq; Restandardization
Year: 2017 PMID: 28184252 PMCID: PMC5294840 DOI: 10.1186/s13040-017-0125-9
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Correlation of count and gene length. The scatterplot of gene length vs average counts (log-log scale) for genes in the LNCaP data set [4]. Darker color indicates higher scatter density. The red curve is a smooth spline with 7 degrees of freedom, showing a positive correlation between count and gene length
Fig. 2Gene length bias in RNA-seq data. The LNCaP data set [4] shows the probability of significant p-values (p<0.05) increases as gene length. Group 1 is the genes of shortest length and Group 10 is the genes with longest length
Type I error of simulation 1
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Set 1 | Set 5 | Set 10 | Set 15 | Set 20 | Set 1 | Set 5 | Set 10 | Set 15 | Set 20 | |
| Weighted | 0.007 | 0.007 | 0.006 | 0.009 | 0.017 | 0 | 0 | 0 | 0.001 | 0.001 |
| Unweighted | 0.002 | 0.007 | 0.008 | 0.014 | 0.022 | 0 | 0 | 0 | 0 | 0.003 |
| GOSeq | 0.019 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CAMERA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Type I error of selected sets (p-value <0.05) under different β. All four methods maintain the type I error under the controlled level (0.05)
Power of simulation 1
| With length bias | No length bias | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
| 6 | 8 | 10 | 6 | 8 | 10 | 6 | 8 | 10 |
| Weighted | 0.36 | 0.48 | 0.77 | 0.81 | 0.95 | 0.98 | 0.61 | 0.93 | 0.96 |
| Unweighted | 0.29 | 0.45 | 0.75 | 0.77 | 0.90 | 0.96 | 0.61 | 0.94 | 0.97 |
| GOSeq | 0.23 | 0.38 | 0.56 | 0.97 | 1.0 | 1.0 | 0.96 | 1.0 | 1.0 |
| CAMERA | 0.01 | 0 | 0 | 0 | 0 | 0 | 0.02 | 0.01 | 0.02 |
Power of identifying set 1 as DE (p-value <0.05) under different β and length bias. The weighted SeqGSA increases the power of the unweighted procedure for detecting set 1. SeqGSA is more powerful with small DE effect (β=0.15), while GOSeq is more powerful with large DE effect (β=0.3). The two procedures of SeqGSA perform similarly under no length bias
Power of simulation 2
| Proportion of DE genes | 0.4 | 0.5 | 0.6 | 0.7 |
|---|---|---|---|---|
| Weighted | 0.22 | 0.53 | 0.71 | 0.87 |
| Unweighted | 0.16 | 0.41 | 0.67 | 0.73 |
| CAMERA | 0.09 | 0.19 | 0.26 | 0.39 |
Power of identifying DE sets (with correlation) in group 1 (p-value <0.05). The DE sets are constructed with a high proportion of DE genes in group 1
Power of simulation 3
| Proportion of DE genes | 0.4 | 0.5 | 0.6 | 0.7 |
|---|---|---|---|---|
| Weighted | 0.45 | 0.75 | 0.90 | 0.95 |
| Unweighted | 0.49 | 0.76 | 0.88 | 0.95 |
Power of identifying DE sets (with correlation) in group 3 and 4 (p-value <0.05)
Fig. 3Proportion of DE of GO categories for unweighted and weighted SeqGSA. Genes in the LNCaP data set in [4] are mapped to the GO categories. The 1585 mapped GO categories are grouped by median gene length from low to high. For the unweighted method (left), the correlation of the proportion of DE categories and gene length is on the borderline of significance (p=0.068). For the weighted method (right), the correlation is not significant (p=0.117)
Fig. 4Heatmap of hsa04810 KEGG pathway. The heatmap of the 94 mapped genes in the hsa04810 KEGG pathway for the CEPH HapMap data [41]. The count data is log-transformed and row-scaled
Fig. 5Standardized maxmean and permutations of hsa04810 pathway. The standardized maxmean statistic and permutations of the hsa04810 pathway for the CEPH HapMap data [41]. The standardardized maxmean (red) by the weighted method falls on the right of the unweighted S ∗ (black), while the permutation statistics of the two methods overlap