| Literature DB >> 21697132 |
Luca Pireddu1, Simone Leo, Gianluigi Zanetti.
Abstract
SUMMARY: SEAL is a scalable tool for short read pair mapping and duplicate removal. It computes mappings that are consistent with those produced by BWA and removes duplicates according to the same criteria employed by Picard MarkDuplicates. On a 16-node Hadoop cluster, it is capable of processing about 13 GB per hour in map+rmdup mode, while reaching a throughput of 19 GB per hour in mapping-only mode. AVAILABILITY: SEAL is available online at http://biodoop-seal.sourceforge.net/.Entities:
Mesh:
Year: 2011 PMID: 21697132 PMCID: PMC3137215 DOI: 10.1093/bioinformatics/btr325
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
SEAL evaluation: input datasets
| Dataset | No. of lanes | No. of pairs | Size (GB) | Read length |
|---|---|---|---|---|
| 5M | 0 | 5.0×106 | 2.3 | 91 |
| DS1 | 1 | 1.2×108 | 51 | 100 |
| DS3 | 3 | 3.3×108 | 147 | 100 |
| DS8 | 8 | 9.2×108 | 406 | 100 |
The 5M dataset consists of the first 5M pairs from run id ERR020229 of the 1000 Genomes Project (Durbin ). The three DS datasets are from a production sequencing run on an Illumina HiSeq 2000.
Comparison of running time in hours between BWA on a single node with 8 cores and SEAL running on 32 nodes without duplicates removal
| Dataset | BWA time (h, 1 node) | SEAL time (h, 32 nodes) |
|---|---|---|
| 5M | 0.49 | 0.04 |
| DS1 | 11.26 | 0.63 |
| DS3 | 32.39 | 1.72 |
| DS8 | 89.35 | 4.78 |
Note that the SEAL running time includes qseq to prq format conversion.
aTime is predicted as a linear extrapolation of the throughput observed on the 5M dataset.
Fig. 1.Throughput per node of the entire SEAL workflow: finding paired reads in different files; computing the alignment; and removing duplicate reads. An ideal system would produce a flat line, scaling perfectly as the cluster size grows. The three datasets used are described in Table 1. By comparison, a single-node workflow we wrote for testing—performing the same work as SEAL but using the standard multithreaded BWA and Picard—reaches a throughput of ~1100 pairs/s on the 5M dataset.