Literature DB >> 25189781

subSeq: determining appropriate sequencing depth through efficient read subsampling.

Abstract

MOTIVATION: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment.
RESULTS: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth.
AVAILABILITY AND IMPLEMENTATION: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.

Entities: Disease Gene Species

Mesh：

Year: 2014 PMID： 25189781 PMCID： PMC4296149 DOI： 10.1093/bioinformatics/btu552

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Many next-generation sequencing technologies have been developed to answer important biological questions. One property these technologies have in common is that they depend on read depth or coverage: increasing the number of reads typically increases the power and accuracy. For instance, in RNA-Seq greater read depth is known to increase the power of differential expression testing and the accuracy of expression estimates (Liu ; Tarazona ). The advent of multiplexed sequencing means that researchers should consider their read depth as a trade-off against cost and replication when designing experiments (Liu ), which means knowing the relationship between read depth and power is essential to designing sequencing experiments. Similarly, many researchers need to demonstrate that they have adequate depth in an existing experiment to support their biological conclusions. One valuable approach that multiple studies have used is to randomly subsample reads (sometimes called downsampling) and perform an identical analysis on each subsample. This is in contrast to methods that fit a parametric model to calculate power, such as Scotty (Busby ). By determining where metrics of power and accuracy ‘saturate’ with increasing depth, one can both determine recommendations for future experiments and demonstrate whether an existing experiment has sufficient depth. Studies have used random subsampling to propose guidelines for future experiments (Black ; Liu ), to perform a survey of different RNA-Seq analysis methods at varying read depths (Labaj ; Liu ; Rapaport ), or to demonstrate that they had achieved adequate read depth (Daines ; Toung ; Wang ). However, all took the approach of randomly subsampling from either the fastq or alignment file, and then reperforming the analysis, including the computationally intensive step of matching reads to genes, on each file. This process is slow, demanding of disk space, and requires possessing the original reads or mappings, which limits the number of subsamples that can be performed and the ease of performing this analysis on existing experiments. We introduce the subSeq R package, which instead subsamples sequencing reads with binomial sampling after they have been matched to genes and assembled into a count matrix. Because the step of matching reads to genes is independent and deterministic, this approach is functionally identical to the common approach of subsampling the read alignment files, but requires only the count matrix rather than the read alignment file. It also takes negligible time and computing resources even on large datasets, as the steps downstream of the read subsampling are much faster than the upstream steps. A similar approach is used to generate saturation figures in the NOISeq package (Tarazona ), but subSeq is designed to be used with any RNA-Seq analysis method. subSeq could be performed immediately on any experiment in the ReCount resource of analysis-ready datasets (Frazee ), and on any RNA-Seq experiment that provides a matrix of read counts per gene. An early version of this software was used in Robinson , on Bar-Seq measurements of the yeast deletion set, to determine the effect of read depth on detection of differential abundance. subSeq also streamlines the process of performing a differential expression analysis on each subsample, and of calculating relevant biological metrics for each to determine how they vary depending on read depth. In particular, subSeq reports metrics representing (i) the power to detect differential expression or abundance, (ii) the accuracy of effect size estimation and (iii) the estimated rate of false discoveries relative to the full experiment.

2 METHODS

The user provides an unnormalized M × N matrix X of read counts, where each row represents one of M genes, each column represents one of N samples and each value denotes the number of reads aligned to each gene within each sample. The user also specifies a vector of K subsampling proportions , each in the interval (0, 1], and the number of replications to perform at each proportion. For each p, a subsampled matrix is generated such that for and . This is equivalent to allowing each original mapped read to have probability p of being included in the new counts, as done, for example, by the Picard DownsampleSam function. For each subsample, we perform the same analysis that is performed on the full set of reads. Multiple approaches for the determination of RNA-Seq differential expression from a matrix of counts, including edgeR (Robinson ) and DESeq2 (Love ), are built into subSeq, as is DEXSeq for differential exon usage detection (Anders ). The user can also provide a custom method to be applied to each subsample. Here we use subSeq to examine the effect of depth on the RNA-Seq dataset from Hammer , testing for differential expression between rats with induced chronic neuropathic pain and a control group. The mapped read counts were downloaded from ReCount, only samples from the 2-month time point were used, and genes with fewer than five mapped reads were filtered out. We subsampled 11 proportions on a logarithmic scale from 0.01 to 1, performing five replications at each proportion.

3 RESULTS

As an illustrative example, we show the results of subsampling of an RNA-Seq dataset from Hammer , using edgeR or DESeq2 to normalize and test each subsample for differential expression. To perform these subsamples manually, it would have required downloading 11.4 Gb of reads, mapping them to the mouse genome, downsampling to produce an additional 95Gb of alignments, matching each read to the gene annotations and only then performing the differential expression analysis. Using subSeq, the subsampling requires only the 4.9 Mb matrix from the ReCount database, can be performed entirely in memory in R and takes a negligible amount of time (<1 s to perform the 55 subsamplings, ∼2–8 minutes to perform the analysis at each step, depending on the method chosen). After constructing subsamples and performing an analysis on each, subSeq calculates and visualizes summary metrics about each sequencing depth (Fig. 1); these plots aid in determining saturation of depth (Supplementary Fig. S1). As the plots show how read depth changes the conclusions of the analysis, the ‘oracle’ is defined as the P-values and estimates at the full depth. To estimate the power, subSeq determines the number of genes found significant at a given false discovery rate. To determine whether the decrease in read depth affects specificity, we also estimate the false discovery proportion (FDP) at each depth. subSeq does this by using the qvalue package to estimate the local false discovery rate for each gene in the oracle, then calculating the average of the oracle local FDR values among the genes found significant at each depth. To determine how depth affects the accuracy of effect size estimation, subSeq compares the log fold-changes estimated at each depth with the oracle estimates, reporting the mean-squared error and the Pearson and Spearman correlations.

Fig. 1.

The default plot generated by subSeq on subsamples of Hammer . This shows the number of significant genes at each depth (top left), the estimated FDP (top right) and the Spearman correlation (bottom left) and mean-squared error (bottom right) comparing the estimates at each depth with the full experiment subSeq is designed to allow any analysis to be performed on each subsample. While the example demonstrated here used RNA-Seq data, subSeq works equally well on other genomic approaches such as Bar-Seq or Tn-Seq, as demonstrated in Robinson .

16 in total

1. mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain.

Authors: Paul Hammer; Michaela S Banck; Ronny Amberg; Cheng Wang; Gabriele Petznick; Shujun Luo; Irina Khrebtukova; Gary P Schroth; Peter Beyerlein; Andreas S Beutler
Journal: Genome Res Date: 2010-05-07 Impact factor: 9.043

2. RNA-sequence analysis of human B-cells.

Authors: Jonathan M Toung; Michael Morley; Mingyao Li; Vivian G Cheung
Journal: Genome Res Date: 2011-05-02 Impact factor: 9.043

3. Differential expression in RNA-seq: a matter of depth.

Authors: Sonia Tarazona; Fernando García-Alcalde; Joaquín Dopazo; Alberto Ferrer; Ana Conesa
Journal: Genome Res Date: 2011-09-08 Impact factor: 9.043

4. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression.

Authors: Michele A Busby; Chip Stewart; Chase A Miller; Krzysztof R Grzeda; Gabor T Marth
Journal: Bioinformatics Date: 2013-01-12 Impact factor: 6.937

5. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose.

Authors: Yichuan Liu; Jane F Ferguson; Chenyi Xue; Ian M Silverman; Brian Gregory; Muredach P Reilly; Mingyao Li
Journal: PLoS One Date: 2013-06-24 Impact factor: 3.240

6. Detecting differential usage of exons from RNA-seq data.

Authors: Simon Anders; Alejandro Reyes; Wolfgang Huber
Journal: Genome Res Date: 2012-06-21 Impact factor: 9.043

7. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling.

Authors: Paweł P Łabaj; Germán G Leparc; Bryan E Linggi; Lye Meng Markillie; H Steven Wiley; David P Kreil
Journal: Bioinformatics Date: 2011-07-01 Impact factor: 6.937

8. Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens.

Authors: Ying Wang; Noushin Ghaffari; Charles D Johnson; Ulisses M Braga-Neto; Hui Wang; Rui Chen; Huaijun Zhou
Journal: BMC Bioinformatics Date: 2011-10-18 Impact factor: 3.169

9. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets.

Authors: Alyssa C Frazee; Ben Langmead; Jeffrey T Leek
Journal: BMC Bioinformatics Date: 2011-11-16 Impact factor: 3.169

10. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

Authors: Mark D Robinson; Davis J McCarthy; Gordon K Smyth
Journal: Bioinformatics Date: 2009-11-11 Impact factor: 6.937

19 in total

1. Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.

Authors: Shuntai Zhou; Corbin Jones; Piotr Mieczkowski; Ronald Swanstrom
Journal: J Virol Date: 2015-06-03 Impact factor: 5.103

2. SimSeq: a nonparametric approach to simulation of RNA-sequence datasets.

Authors: Sam Benidt; Dan Nettleton
Journal: Bioinformatics Date: 2015-02-26 Impact factor: 6.937

3. Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing.

Authors: Graham Heimberg; Rajat Bhatnagar; Hana El-Samad; Matt Thomson
Journal: Cell Syst Date: 2016-04-27 Impact factor: 10.304

4. Functional genomic analysis and neuroanatomical localization of miR-2954, a song-responsive sex-linked microRNA in the zebra finch.

Authors: Ya-Chi Lin; Christopher N Balakrishnan; David F Clayton
Journal: Front Neurosci Date: 2014-12-16 Impact factor: 4.677

5. Regulation of X-linked gene expression during early mouse development by Rlim.

Authors: Feng Wang; JongDae Shin; Jeremy M Shea; Jun Yu; Ana Bošković; Meg Byron; Xiaochun Zhu; Alex K Shalek; Aviv Regev; Jeanne B Lawrence; Eduardo M Torres; Lihua J Zhu; Oliver J Rando; Ingolf Bach
Journal: Elife Date: 2016-09-19 Impact factor: 8.140

6. Thousands of RNA-cached copies of whole chromosomes are present in the ciliate Oxytricha during development.

Authors: Kelsi A Lindblad; John R Bracht; April E Williams; Laura F Landweber
Journal: RNA Date: 2017-04-27 Impact factor: 4.942

7. Comparison of alternative approaches for analysing multi-level RNA-seq data.

Authors: Irina Mohorianu; Amanda Bretman; Damian T Smith; Emily K Fowler; Tamas Dalmay; Tracey Chapman
Journal: PLoS One Date: 2017-08-08 Impact factor: 3.240

8. Effects of subsampling on characteristics of RNA-seq data from triple-negative breast cancer patients.

Authors: Alexey Stupnikov; Galina V Glazko; Frank Emmert-Streib
Journal: Chin J Cancer Date: 2015-08-08

Review 9. A survey of best practices for RNA-seq data analysis.

Authors: Ana Conesa; Pedro Madrigal; Sonia Tarazona; David Gomez-Cabrero; Alejandra Cervera; Andrew McPherson; Michał Wojciech Szcześniak; Daniel J Gaffney; Laura L Elo; Xuegong Zhang; Ali Mortazavi
Journal: Genome Biol Date: 2016-01-26 Impact factor: 13.583

10. Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data.

Authors: Nikolaus Fortelny; Christoph Bock
Journal: Genome Biol Date: 2020-08-03 Impact factor: 13.583