David G Robinson1, John D Storey2. 1. Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA. 2. Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA Lewis-Sigler Institute for Integrative Genomics and Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA.
Abstract
MOTIVATION: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment. RESULTS: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth. AVAILABILITY AND IMPLEMENTATION: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.
MOTIVATION: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment. RESULTS: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth. AVAILABILITY AND IMPLEMENTATION: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.
Many next-generation sequencing technologies have been developed to answer important biological questions. One property these technologies have in common is that they depend on read depth or coverage: increasing the number of reads typically increases the power and accuracy. For instance, in RNA-Seq greater read depth is known to increase the power of differential expression testing and the accuracy of expression estimates (Liu ; Tarazona ). The advent of multiplexed sequencing means that researchers should consider their read depth as a trade-off against cost and replication when designing experiments (Liu ), which means knowing the relationship between read depth and power is essential to designing sequencing experiments. Similarly, many researchers need to demonstrate that they have adequate depth in an existing experiment to support their biological conclusions.One valuable approach that multiple studies have used is to randomly subsample reads (sometimes called downsampling) and perform an identical analysis on each subsample. This is in contrast to methods that fit a parametric model to calculate power, such as Scotty (Busby ). By determining where metrics of power and accuracy ‘saturate’ with increasing depth, one can both determine recommendations for future experiments and demonstrate whether an existing experiment has sufficient depth. Studies have used random subsampling to propose guidelines for future experiments (Black ; Liu ), to perform a survey of different RNA-Seq analysis methods at varying read depths (Labaj ; Liu ; Rapaport ), or to demonstrate that they had achieved adequate read depth (Daines ; Toung ; Wang ). However, all took the approach of randomly subsampling from either the fastq or alignment file, and then reperforming the analysis, including the computationally intensive step of matching reads to genes, on each file. This process is slow, demanding of disk space, and requires possessing the original reads or mappings, which limits the number of subsamples that can be performed and the ease of performing this analysis on existing experiments.We introduce the subSeq R package, which instead subsamples sequencing reads with binomial sampling after they have been matched to genes and assembled into a count matrix. Because the step of matching reads to genes is independent and deterministic, this approach is functionally identical to the common approach of subsampling the read alignment files, but requires only the count matrix rather than the read alignment file. It also takes negligible time and computing resources even on large datasets, as the steps downstream of the read subsampling are much faster than the upstream steps. A similar approach is used to generate saturation figures in the NOISeq package (Tarazona ), but subSeq is designed to be used with any RNA-Seq analysis method. subSeq could be performed immediately on any experiment in the ReCount resource of analysis-ready datasets (Frazee ), and on any RNA-Seq experiment that provides a matrix of read counts per gene. An early version of this software was used in Robinson , on Bar-Seq measurements of the yeast deletion set, to determine the effect of read depth on detection of differential abundance.subSeq also streamlines the process of performing a differential expression analysis on each subsample, and of calculating relevant biological metrics for each to determine how they vary depending on read depth. In particular, subSeq reports metrics representing (i) the power to detect differential expression or abundance, (ii) the accuracy of effect size estimation and (iii) the estimated rate of false discoveries relative to the full experiment.
2 METHODS
The user provides an unnormalized M × N matrix X of read counts, where each row represents one of M genes, each column represents one of N samples and each value denotes the number of reads aligned to each gene within each sample. The user also specifies a vector of K subsampling proportions , each in the interval (0, 1], and the number of replications to perform at each proportion. For each p, a subsampled matrix is generated such that for and . This is equivalent to allowing each original mapped read to have probability p of being included in the new counts, as done, for example, by the Picard DownsampleSam function.For each subsample, we perform the same analysis that is performed on the full set of reads. Multiple approaches for the determination of RNA-Seq differential expression from a matrix of counts, including edgeR (Robinson ) and DESeq2 (Love ), are built into subSeq, as is DEXSeq for differential exon usage detection (Anders ). The user can also provide a custom method to be applied to each subsample.Here we use subSeq to examine the effect of depth on the RNA-Seq dataset from Hammer , testing for differential expression between rats with induced chronic neuropathic pain and a control group. The mapped read counts were downloaded from ReCount, only samples from the 2-month time point were used, and genes with fewer than five mapped reads were filtered out. We subsampled 11 proportions on a logarithmic scale from 0.01 to 1, performing five replications at each proportion.
3 RESULTS
As an illustrative example, we show the results of subsampling of an RNA-Seq dataset from Hammer , using edgeR or DESeq2 to normalize and test each subsample for differential expression. To perform these subsamples manually, it would have required downloading 11.4 Gb of reads, mapping them to the mouse genome, downsampling to produce an additional 95Gb of alignments, matching each read to the gene annotations and only then performing the differential expression analysis. Using subSeq, the subsampling requires only the 4.9 Mb matrix from the ReCount database, can be performed entirely in memory in R and takes a negligible amount of time (<1 s to perform the 55 subsamplings, ∼2–8 minutes to perform the analysis at each step, depending on the method chosen).After constructing subsamples and performing an analysis on each, subSeq calculates and visualizes summary metrics about each sequencing depth (Fig. 1); these plots aid in determining saturation of depth (Supplementary Fig. S1). As the plots show how read depth changes the conclusions of the analysis, the ‘oracle’ is defined as the P-values and estimates at the full depth. To estimate the power, subSeq determines the number of genes found significant at a given false discovery rate. To determine whether the decrease in read depth affects specificity, we also estimate the false discovery proportion (FDP) at each depth. subSeq does this by using the qvalue package to estimate the local false discovery rate for each gene in the oracle, then calculating the average of the oracle local FDR values among the genes found significant at each depth. To determine how depth affects the accuracy of effect size estimation, subSeq compares the log fold-changes estimated at each depth with the oracle estimates, reporting the mean-squared error and the Pearson and Spearman correlations.
Fig. 1.
The default plot generated by subSeq on subsamples of Hammer . This shows the number of significant genes at each depth (top left), the estimated FDP (top right) and the Spearman correlation (bottom left) and mean-squared error (bottom right) comparing the estimates at each depth with the full experiment
The default plot generated by subSeq on subsamples of Hammer . This shows the number of significant genes at each depth (top left), the estimated FDP (top right) and the Spearman correlation (bottom left) and mean-squared error (bottom right) comparing the estimates at each depth with the full experimentsubSeq is designed to allow any analysis to be performed on each subsample. While the example demonstrated here used RNA-Seq data, subSeq works equally well on other genomic approaches such as Bar-Seq or Tn-Seq, as demonstrated in Robinson .
Authors: Paul Hammer; Michaela S Banck; Ronny Amberg; Cheng Wang; Gabriele Petznick; Shujun Luo; Irina Khrebtukova; Gary P Schroth; Peter Beyerlein; Andreas S Beutler Journal: Genome Res Date: 2010-05-07 Impact factor: 9.043
Authors: Michele A Busby; Chip Stewart; Chase A Miller; Krzysztof R Grzeda; Gabor T Marth Journal: Bioinformatics Date: 2013-01-12 Impact factor: 6.937
Authors: Yichuan Liu; Jane F Ferguson; Chenyi Xue; Ian M Silverman; Brian Gregory; Muredach P Reilly; Mingyao Li Journal: PLoS One Date: 2013-06-24 Impact factor: 3.240
Authors: Paweł P Łabaj; Germán G Leparc; Bryan E Linggi; Lye Meng Markillie; H Steven Wiley; David P Kreil Journal: Bioinformatics Date: 2011-07-01 Impact factor: 6.937
Authors: Feng Wang; JongDae Shin; Jeremy M Shea; Jun Yu; Ana Bošković; Meg Byron; Xiaochun Zhu; Alex K Shalek; Aviv Regev; Jeanne B Lawrence; Eduardo M Torres; Lihua J Zhu; Oliver J Rando; Ingolf Bach Journal: Elife Date: 2016-09-19 Impact factor: 8.140
Authors: Ana Conesa; Pedro Madrigal; Sonia Tarazona; David Gomez-Cabrero; Alejandra Cervera; Andrew McPherson; Michał Wojciech Szcześniak; Daniel J Gaffney; Laura L Elo; Xuegong Zhang; Ali Mortazavi Journal: Genome Biol Date: 2016-01-26 Impact factor: 13.583