| Literature DB >> 24009693 |
Ram Vinay Pandey1, Christian Schlötterer.
Abstract
With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/Entities:
Mesh:
Year: 2013 PMID: 24009693 PMCID: PMC3751911 DOI: 10.1371/journal.pone.0072614
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The workflow of DistMap.
The DistMap workflow has 7 modules. Modul 1 is to index reference fasta file. Module 2-6 are mandetory for each new FASTQ files. The DistMap entire workflow can be executed at once and if required each module can be executed one by one.
List of mappers supported in current version of DistMap.
|
|
|
|
|---|---|---|
| BWA | 0.5.8c | Short read mapping |
| bowtie | 0.12.7 | Short read mapping |
| Bowtie2 | 2.0.6 | Short read mapping |
| GSNAP | 2012-07-20 | RNA-Seq Alignment DNA methylation |
| SOAP | 2.20 | Short read mapping |
| STAR | 2.2.0c | RNA-Seq Alignment |
| Bismark | v 0.7.7 | Bisulfite mapping |
| BSMAP | 2.73 | Bisulfite mapping |
| TopHat | 2.0.6 | RNA-Seq mapping |
Nine short read mappers for a wide range of NGS applications are supported. DistMap can use any version of these mappers.
DistMap evaluation input datasets.
|
|
|
|
|---|---|---|
| 5 | 2.46 | 100 |
| 50 | 24.7 | 100 |
| 100 | 49.4 | 100 |
| 200 | 98.82 | 100 |
| 500 | 247.04 | 100 |
These NGS data were generated from Illumina HiSeq sequencer. 100bp paired-end reads from genomes.
Figure 2Evaluation of DistMap execution time with increase of data size.
Execution times were measured for different mappers using DistMap (red line) and compared to a single node (blue line). Datasets of different size (5, 50, 100, 200 and 500 million read pairs) from different pool-Seq experiments were used to estimate the scalability of DistMap. The hardware configuration of the single node was a Mac OSX 10.6.8 computer with 10 CPU, 32 GB RAM and 1TB Disk space available for mapping. The Hadoop cluster consists of 13 worker nodes with the same configuration. (a) BWA mapping (b) bowtie mapping, (c) GSNAP mapping, (d) mapping. All mappers were run with the same default parameters and datasets.
Comparison of running time in hours of different mappers
on a single node with 10 cores and DistMap running on 13 nodes.
|
|
|
|
|
|---|---|---|---|
| 5 | BWA | 0.12 | 0.47 |
| 50 | BWA | 0.70 | 4.22 |
| 100 | BWA | 1.40 | 8.33 |
| 200 | BWA | 2.78 | 19.35 |
| 500 | BWA | 6.05 | 82.73* |
| 5 | bowtie | 0.10 | 0.18 |
| 50 | bowtie | 0.30 | 1.77 |
| 100 | bowtie | 0.68 | 3.50 |
| 200 | bowtie | 0.97 | 7.10 |
| 500 | bowtie | 3.43 | 22.38 |
| 5 | GSNAP | 0.1 | 0.63 |
| 50 | GSNAP | 0.57 | 5.62 |
| 100 | GSNAP | 1.07 | 11.12 |
| 200 | GSNAP | 2.17 | 22.13 |
| 500 | GSNAP | 4.97 | 120.75* |
| 5 | SOAP | 0.08 | 0.52 |
| 50 | SOAP | 0.47 | 5.48 |
| 100 | SOAP | 0.87 | 9.2 |
| 200 | SOAP | 1.68 | 19.63 |
| 500 | SOAP | 4.88 | 49.4 |
* Indicates cases where DistMap is faster than separate mapping on 13 individual nodes
Comparison of various features of DistMap and other tools for distributed mapping.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| SAM output | no | no | no | no | yes | yes |
| BAM output | no | no | no | no | yes | yes |
| Pair-end Mapping | yes | yes | yes | yes | yes | yes |
| Single-end Mapping | yes | no | no | no | yes | yes |
| Bisulfite Mapping | no | no | no | no | yes | no |
| Installation required | yes | yes | yes | yes | no | yes |
| Dependency | yes | yes | yes | yes | no | yes |
| Operating system | All Unix | Only Linux | Only Linux | All OS | All Unix | All OS |
| Mappers (complied) | bowtie | BWA | BWA, bowtie, SOAP, GSNAP | GSNAP | BWA, bowtie, Bowtie2, GSNAP, SOAP, STAR, TopHat, TopHat2, Bismark, BSMAP | bowtie, BWA |
| Mapper version dependency | no | yes | yes | yes | no | yes |
| Reference sequence version dependency | no | no | no | no | no | yes |
| Custom queue/pool support | no | no | no | no | yes | no |
| Fair scheduler and Capacity scheduler | no | no | no | no | yes | no |