| Literature DB >> 30726870 |
Kimon Froussios1, Nick J Schurch1, Katarzyna Mackinnon2, Marek Gierliński1, Céline Duc2,3, Gordon G Simpson2,4, Geoffrey J Barton1.
Abstract
MOTIVATION: RNA-seq experiments are usually carried out in three or fewer replicates. In order to work well with so few samples, differential gene expression (DGE) tools typically assume the form of the underlying gene expression distribution. In this paper, the statistical properties of gene expression from RNA-seq are investigated in the complex eukaryote, Arabidopsis thaliana, extending and generalizing the results of previous work in the simple eukaryote Saccharomyces cerevisiae.Entities:
Mesh:
Year: 2019 PMID: 30726870 PMCID: PMC6748783 DOI: 10.1093/bioinformatics/btz089
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
RNA-seq DGE tools used in this study
| Name | Assumed distribution | Normalization | Description | Version | Citations |
|---|---|---|---|---|---|
|
| Negative binomial | Internal | Empirical Bayesian estimate of posterior likelihood | 2.4 | 259 |
|
| Binomial | None | Random sampling model using Fisher's exact test and the likelihood ratio test | 1.24.0 | 748 |
|
| Negative binomial | DEseq | Shrinkage variance | 1.22.0 | 4308 |
|
| Negative binomial | DEseq | Shrinkage variance | 1.10.0 | 4277 |
|
| Negative binomial | DEseq (median) | Empirical Bayesian estimate of posterior likelihood | 1.10.0 | 301 |
|
| Negative binomial | TMM | Empirical Bayes estimation and either an exact test analogous to Fisher’s exact test but adapted to over-dispersed data or a generalized linear model | 3.12 | 5339 |
|
| Log-normal | TMM | Generalized linear model | 3.26.2 | 2197 |
|
| Poisson log-linear model | Internal | Score statistic | 1.1.2 | 80 |
|
| None | Internal | Mann–Whitney test with Poisson resampling | 2.0 | 136 |
Note: A list of the DGE tools and their respective versions used in this study, together with their core methodology. The number of citations is shown as proxy for each tool’s popularity.
Citations as reported by PubMed Central: number of articles that reference the listed source on January 28, 2019.
Fig. 1.Pairwise inter-replicate Pearson’s correlation of gene expression. The black grid lines indicate the grouping of the replicates with regards to the three experiments. (A) Correlation matrix of gene expression for all 17 replicates. Apart from replicate 11, all replicates correlate very well. (B) Same as left, but with replicate 11 filtered out, allowing the patterns of correlation among the remaining 16 replicates to be better seen
Fig. 2.Inter-replicate variation goodness-of-fit. Histograms of the probability that the genes’ fragment counts across replicates are compatible with each of the four specified distributions. The fraction of genes rejecting the distribution model is given above each plot. The Benjamini–Hochberg adjusted critical P-value is shown in red
Fig. 3.Distribution histogram of gene expression. Each gene is represented by the mean of its read count estimates across replicates. The various levels of non-zero expression are shown in blue. The x-axis here is logarithmic, so genes with zero expression were added manually at an arbitrary but distinct location on the axis (red bar). The y-axis is square-root scaled
Fraction of genes whose cross-replicate expression distribution rejects the null hypothesis for each of four distribution models
| Replicates | Poisson (%) | Normal (%) | Log- normal (%) | Neg. Binomial (%) |
|---|---|---|---|---|
| (i) | 70 | 23 | 2 | 0 |
| (ii) | 65 | 10 | 0 | 0 |
| (iii) | 59 | 9 | 0 | 0 |
Note: Cases: (i) all replicates 1–17, excluding the contaminated replicate 11 (see also Fig. 2), (ii) only the non-noisy replicates 1–7 and 15–17 and (iii) replicates 8–10 and 12–17 as control for statistical power.
Fig. 4.FP fractions in WT versus WT comparisons of DGE. A total of 100 bootstrap iterations performed for each value out of a range of sample sizes from 3 to 7 replicates per condition. The plots show the median (horizontal line), quartiles (shaded blue boxes), 95% data limits (capped vertical lines) and outliers (black points) for the fraction of bootstraps in which a gene was called as differentially expressed (without any fold-change threshold). Panel (A) and (B) differ on the range of the Y-axis. DEGSeq displays poor FP performance (nearly 50% of its positives are false). The performance of the tools is a result of their choice in methods and models (Table 1), with the lowest FP tools using the negative binomial or log-normal distributions