| Literature DB >> 23408855 |
Danai Fimereli1, Vincent Detours, Tomasz Konopka.
Abstract
High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23408855 PMCID: PMC3627586 DOI: 10.1093/nar/gkt094
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Single-gene classification performance. (a) TPR of classification decreases with the hits parameter. (b) Proportion of reads that are actually off-target but pass through the classification procedure. As expected, this FPR decreases with s and H. (c) A ROC-style representation of the panels (a) and (b). To make the points distinguishable, the vertical scale only shows a small fraction of the full TPR range and the horizontal scale is logarithmic. The series of points in each seedlength group represent different hits thresholds. (d) Running time of classification and mapping. For reference, the running for the full alignment was 42 h.
Figure 2.Multiple-gene classification performance. Analogous to Figure 1, but showing results for a target region consisting of 464 cancer genes. In contrast to the case study in Figure 1, the total running is here dominated by alignment rather than extraction. Thus, it is the larger seedlengths that provide higher speedup [panel (d)].
Speedup due to targeted alignment
| NOTCH1 | Cancer genes | |||
|---|---|---|---|---|
| Sample | 1 core | 4 cores | 1 core | 4 cores |
| RNA-seq (2 × 50 bp) | 96 | 53 | 6.8 | 6.7 |
| RNA-seq (1 × 75 bp) | 92 | 25 | 4.5 | 4.0 |
| Exome-seq (2 × 100 bp) | 43 | 16 | 3.0 | 3.1 |
| WG-seq (2 × 100 bp) | 66 | 19 | 2.8 | 2.6 |
Speedups computed by comparing the time required to perform the full alignment and the combined time of triage classification plus mapping of the selected reads. For RNA-seq samples, target regions consisted of exons and 20 bp flanking sequences, s = 14 and H = 46. For exome and WG samples, which have longer reads, target regions consisted of exons and 85 bp flanking regions, s = 14 and H = 85. (Speedups are approximate with error; results may also vary depending on options used, background load, etc.)