| Literature DB >> 23589650 |
Danni Yu1, Wolfgang Huber, Olga Vitek.
Abstract
MOTIVATION: RNA-seq experiments produce digital counts of reads that are affected by both biological and technical variation. To distinguish the systematic changes in expression between conditions from noise, the counts are frequently modeled by the Negative Binomial distribution. However, in experiments with small sample size, the per-gene estimates of the dispersion parameter are unreliable.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23589650 PMCID: PMC3654711 DOI: 10.1093/bioinformatics/btt143
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Existing and proposed approaches for differential analysis of RNA-seq experiments with two conditions
| Probability model | Estimation of dispersion | Testing | Time | ||
|---|---|---|---|---|---|
| (a) sSeq (proposed) (this manuscript) | Yes | min | |||
| (b) edgeR (Robinson and Smyth, 2008) | Yes* | min | |||
| (c) DESeq ( | Yes | min | |||
| (d) baySeq ( | Yes | h | |||
| (e) BBSeq ( | Yes | h | |||
| (f) SAMseq ( | Non-parametric | No | min |
(a) is the size factor for sample j in condition i as defined in (Anders and Huber, 2010). is the expected normalized expression of gene g for a sample in condition i. is the per-gene dispersion estimate using the method of moments in Equation (6).
(b) is the ‘effective’ library size. is the probability that a read in i maps to gene g. *Up to v2.4.6.
(c) is gene- and condition-specific dispersion. and can be estimated by the method of moments or by the Cox-Reid corrected Maximum Likelihood.
(d) is the size of the library i from condition j. is as in (b).
(e) is as in (b). is as in (d). is the coefficient of the linear predictor associated with an indicator Z of conditions. Column ‘Time’ is the run time for the experimental datasets in Section 4 on a laptop computer.
Fig. 1.Dispersion and variance estimation in Simulation1. Similar plots for other datasets are shown in Supplementary Materials. (a) ASD versus shrinkage target . ASD is maximized at (solid horizontal line). The dashed lines are the selected target and its ASD. (b) The proposed shrinkage estimator is a linear transformation of , with the slope and the fixed point . All are transformed to . (c, e and g) Dispersion estimates by sSeq, edgeR and DESeq versus the per-gene mean read counts across conditions. Gray smooth scatter are (same on all the plots). Black dots are estimated by each method. Gray lines indicate the true dispersion parameters. (d, f and h) Same as above, but for the variances of the read counts
Areas under the ROC curves of detecting differentially expressed genes for the datasets with an external ‘gold standard’ while varying the FDR-adjusted P-value or posterior probability cutoff
| Methods | Simulation1 | Simulation2 | Simulation3 | MAQC Project | Griffith | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Proposed | sSeq | 0.947 | 0.962 | 0.951 | 0.967 | 0.856 | 0.888 | 0.585 | 0.911 | 0.689 |
| Existing | edgeR | 0.918 | 0.948 | 0.938 | 0.951 | 0.840 | 0.833 | 0.558 | 0.850 | 0.557 |
| DESeq | 0.932 | 0.940 | 0.937 | 0.949 | 0.842 | 0.816 | 0.577 | 0.867 | 0.596 | |
| baySeq | 0.568 | 0.711 | 0.548 | 0.714 | 0.558 | 0.628 | 0.551 | 0.852 | 0.702 | |
| BBSeq | 0.675 | 0.672 | 0.669 | 0.674 | 0.578 | 0.619 | 0.560 | 0.617 | 0.544 | |
| SAMseq | 0.964 | 0.968 | 0.882 | 0.563 | ||||||
Sub-columns are subsets of the data with one randomly selected replicate per condition and the full datasets. Values closer to 1 indicate higher sensitivity and specificity.
Fig. 2.The ECDF curves of detecting differential expression for the datasets with no external ‘gold standard’. Y-axis: ECDF, function of the gene rank. x-axis: P-value or 1 minus posterior probability. Solid line: two randomly selected replicates from a same condition (AvsA). Dotted line: one randomly selected replicate from each condition (unreplicated AvsB). Dashed line: AvsB on the full dataset for two-group designs. Dashed-dotted line: AvsB on the full dataset for more complex designs. Gray line: 45 degree. SAMseq is not applicable to unreplicated experiments and is excluded. The desired patterns are high areas under the AvsB curves, and AvsA curves that are at or below the 45 degree line