| Literature DB >> 24812344 |
Gregory G Faust1, Ira M Hall2.
Abstract
MOTIVATION: Illumina DNA sequencing is now the predominant source of raw genomic data, and data volumes are growing rapidly. Bioinformatic analysis pipelines are having trouble keeping pace. A common bottleneck in such pipelines is the requirement to read, write, sort and compress large BAM files multiple times.Entities:
Mesh:
Year: 2014 PMID: 24812344 PMCID: PMC4147885 DOI: 10.1093/bioinformatics/btu314
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Custom data structure in SAMBLASTER with a separate set of reference-offset pairs, stored as a hash table, for each combination of sequence1, strand1, sequence2 and strand2. The hash tables are optimized to store 64-bit integers
Comparative runtime, memory usage and disk usage statistics for SAMBLASTER 0.1.14, PICARD MarkDuplicates 1.99 and SAMBAMBA markdup 0.4.4 as stand-alone duplicate marking tools, and in a common pipeline that produces a duplicate marked position-sorted BAM file as its final output
| Tool | Mark dups threads | Extra disk (GB) | Total disk IO (G ops) | CPU time (sec) | Wall time (min) | Mem usage (GB) |
|---|---|---|---|---|---|---|
| Stand-alone mark duplicates function | ||||||
| SAMBLASTER | 1 | – | 1.863 | 2077 | 43 | ∼15 |
| SAMBAMBA | 1 | – | 2.285 | 6338 | 75 | ∼32 |
| SAMBAMBA | 32 | – | 2.285 | 6603 | 54 | ∼43 |
| PICARD | 32 | – | 3.056 | 63 160 | 302 | ∼30 |
| Mark duplicates–sort–compress pipeline | ||||||
| No duplicate marking | – | – | 1.954 | 51 819 | 117 | ∼19 |
| SAMBLASTER | 1 | 0 | 1.987 | 52 767 | 118 | ∼23 |
| SAMBAMBA cmp | 32 | 108 | 2.455 | 86 512 | 154 | ∼43 |
| SAMBAMBA ucmp | 32 | 391 | 3.997 | 61 321 | 163 | ∼43 |
Note: In the pipeline, SAMBAMBA sort and compression are used. There is also a control pipeline run without duplicate marking, which demonstrates that SAMBLASTER adds little overhead. SAMBAMBA markdup times are shown for both an uncompressed (ucmp) and compressed (cmp) position-sorted intermediate file. These tests were run using local RAID storage with fast read/write times. In a more common scenario using networked disk access, SAMBLASTER’s reduced IO results in greater runtime savings versus the other tools.