| Literature DB >> 27623864 |
Kai Dong1, Hongyu Zhao2, Tiejun Tong1, Xiang Wan3.
Abstract
BACKGROUND: RNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493-2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated.Entities:
Keywords: Linear discriminant analysis; Negative binomial distribution; RNA-Seq
Mesh:
Substances:
Year: 2016 PMID: 27623864 PMCID: PMC5022247 DOI: 10.1186/s12859-016-1208-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Numerical comparisons between NBLDA and PLDA. The left panel shows the results with a common dispersion ϕ. The right panel shows the results with different gene-specific dispersions ϕ which are i.i.d. random variables from a chi-squared distribution with r degrees of freedom. We compute the discriminant scores of NBLDA and PLDA for different ϕ and r
Fig. 2Mean misclassification rates for all four methods with ϕ=20 and σ=5. The x-axis represents the proportion of differentially expressed genes. 20, 40, 60, 80 and 100 % differentially expressed genes are considered, respectively. These plots investigate the effect of proportion of differentially expressed genes
Fig. 3Mean misclassification rates for all four methods with ϕ=20 and σ=5. “80 % DE” means 80 % genes are differentially expressed, and the same to “40 % DE”. This plot investigates the effect of numbers of genes
Fig. 4Mean misclassification rates for all four methods with σ=5. “80 % DE” means 80 % genes are differentially expressed, and the same to “40 % DE”. This plot investigates the effect of overdispersion
Fig. 5Mean misclassification rates for real data sets
The medians of their dispersions for Cervical cancer data and HapMap data, where "G" represents the number of top genes selected by edgeR (version 3.3)
| Data sets |
|
|
|
|
|---|---|---|---|---|
| Cervical cancer | 21.2 | 23.3 | 18.2 | 11.0 |
| HapMap | 36.4 | 40.1 | 38.2 | 20.1 |