| Literature DB >> 28872584 |
Wolfgang Kaisers1,2, Holger Schwender3,4, Heiner Schaal5,6.
Abstract
Merging data from multiple samples is required to detect low expressed transcripts or splicing events that might be present only in a subset of samples. However, the exact number of required replicates enabling the detection of such rare events often remains a mystery but can be approached through probability theory. Here, we describe a probabilistic model, relating the number of observed events in a batch of samples with observation probabilities. Therein, samples appear as a heterogeneous collection of events, which are observed with some probability. The model is evaluated in a batch of 54 transcriptomes of human dermal fibroblast samples. The majority of putative splice-sites (alignment gap-sites) are detected in (almost) all samples or only sporadically, resulting in an U-shaped pattern for observation probabilities. The probabilistic model systematically underestimates event numbers due to a bias resulting from finite sampling. However, using an additional assumption, the probabilistic model can predict observed event numbers within a <10% deviation from the median. Single samples contain a considerable amount of uniquely observed putative splicing events (mean 7122 in alignments from TopHat alignments and 86,215 in alignments from STAR). We conclude that the probabilistic model provides an adequate description for observation of gap-sites in transcriptome data. Thus, the calculation of required sample sizes can be done by application of a simple binomial model to sporadically observed random events. Due to the large number of uniquely observed putative splice-sites and the known stochastic noise in the splicing machinery, it appears advisable to include observation of rare splicing events into analysis objectives. Therefore, it is beneficial to take scores for the validation of gap-sites into account.Entities:
Keywords: RNA-seq; alternative splicing; splicing; transcriptome sequencing; wgis
Mesh:
Year: 2017 PMID: 28872584 PMCID: PMC5618549 DOI: 10.3390/ijms18091900
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Number of gap-sites observed in samples of different sizes (STAR).
Number of gap-sites from different sample sizes (STAR).
| nFiles | Total | ||
|---|---|---|---|
| 2 | 706 | 378 | 92 |
| 4 | 1076 | 666 | 105 |
| 8 | 1708 | 1179 | 124 |
| 12 | 2270 | 1659 | 137 |
Absolute number of gap-sites (in 1000).
Figure 2Distribution of gap-site multiplicities in single samples: (a) Alignments from STAR; (b) Alignments from TopHat. For each gap-site, the multiplicity in the whole batch of 54 samples was determined. Then, for each of the 54 samples, the absolute number of multiplicities contained therein was tabled. Median gap-site numbers (dark gray) together with 25% and 75% quantiles (dashed lines) in 54 samples are shown. The light gray lines indicate minimal and maximal number of gap-sites.
Figure 3Observed and predicted numbers of gap-sites: (a) Alignments from STAR; (b) Alignments from TopHat. Observed number of gap-sites (y-axis) from 200 randomly drawn sub-batches with varying numbers of samples (x-axis) are shown as solid circles (dark gray). Results from a Loess regression are shown as solid line (light gray). The predictions from the (uncorrected) raw model indicate a too small terminal slope (dotted line). The predictions from the completed model indicate improved consistency with the observed numbers.
Sample size calculation.
| Sample Size | |
|---|---|
| 0.1 | 16 |
| 0.15 | 10 |
| 0.2 | 8 |
| 0.5 | 3 |
| 0.8 | 1 |
Required sample size for detection power of >80%.