| Literature DB >> 24810850 |
Faraz Hach1, Iman Sarrafi2, Farhad Hormozdiari3, Can Alkan4, Evan E Eichler5, S Cenk Sahinalp6.
Abstract
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.Entities:
Mesh:
Year: 2014 PMID: 24810850 PMCID: PMC4086126 DOI: 10.1093/nar/gku370
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Reference genome indexing times and index sizes for complete human genome (hg19)
| Software | Indexing time (min) | Index size (GB) |
|---|---|---|
| mrsFAST-Ultra | 8 | 2 |
| mrsFAST | 26 | 20 |
| BWA | 62 | 5.1 |
| Bowtie2 | 107 | 3.8 |
| GEM | 181 | 4.1 |
| RazerS3a | NA | NA |
| GSNAP | 11 | 5.1 |
| SRmapper | 18 | 5.5 |
| Masaib | 105 | 15 |
aRazerS3 does not need a genome index for performing alignments.
bMasai requires 18.7GB of memory for indexing hg19. This could not be executed on our benchmarking machine with a single CPU and 12GB of RAM. Therefore Masai indexing has been performed on a different machine with 256GB of RAM and higher CPU power and I/O speed.
Mapping 2M reads from NA18507 to GRCh37 with e ≤ 6
| Software | Time (min) | No. of mappings (millions) | % of reads mapped | ||
|---|---|---|---|---|---|
| 1-thread | 4-threads | ||||
| mrsFAST-Ultra | 71 | 21 | 308.302 | 90.55 | |
| mrsFAST-Ultra (SNP) | 107 | 32 | 341.418 | 92.27 | |
| mrsFAST | 362 | NA | 308.302 | 90.55 | |
| BWA | 80 | 33 | 268.194 | 90.22 | |
| Bowtie2b | 420 | 123 | 33.373 | 90.42 | |
| GEM | 15 | 4 | 8.996 | 89.03 | |
| Razers3 | 528 | 234 | 50.653 | 90.55 | |
| GSNAP | 184 | 60 | 5.117 | 77.44 | |
| SRmapper | 166 | NA | 2.076 | 89.63 | |
| mrsFAST-Ultrac | 6 | 2 | 21.866 | 11.71 | |
| Masaid | 33 | NA | 21.829 | 11.70 | |
All tools are set to report all mapping locations when possible.
aNote that the SNP-aware mrsFAST-Ultra employs dbSNP132 for this task. The base quality of SNP locations are higher than 99% (ASCII value 53). The base is matching either of the major/minor alleles.
bFor Bowtie2, we report the time when it is set to return at most 1000 mappings per read—without this bound it does not complete the task in 12 h.
cTo be able to compare to Masai, we run mrsFAST-Ultra only on chr1.
dMasai crashes during indexing on the full human genome on our benchmarking machine. Results are shown only for mapping the reads to chr1.
Running time (in min) for reporting n mapping locations per read
| Software | |||
|---|---|---|---|
| mrsFAST-Ultra | 58 | 62 | 71 |
| BWA | 69 | 69 | 80 |
| Bowtie2a | 35 | 420 | NA |
| GEMb | 14 | 15 | 16 |
| RazerS3 | 382 | 420 | 528 |
| GSNAP | 183 | 184 | 184 |
| SRmapper | 166 | 166 | 166 |
aBowtie2 cannot complete the task in 12 h with the -a option.
bNote that although GEM provides the best speed, it has lower sensitivity and has a higher memory requirement in comparison to mrsFAST-Ultra (4.1GB versus 2.5GB).
Mapping of 2M reads in the best mapping mode, with an error threshold of 2, 4 and 6
| Software | ||||||
|---|---|---|---|---|---|---|
| Time (min) | % of reads mapped | Time (min) | % of reads mapped | Time (min) | % reads mapped | |
| mrsFAST-Ultra | 9 | |||||
| BWA | 11 | 87.52 | 18 | 90.22 | ||
| Bowtie2 | 10 | 10 | 87.52 | 10 | 89.77 | |
| GEM | 6 | 87.18 | 13 | 89.33 | ||
| RazerS3 | 14 | 60 | 326 | |||
| GSNAP | 156 | 71.74 | 180 | 75.81 | 184 | 77.33 |
| SRmapper | 87 | 80.84 | 139 | 86.93 | 166 | 89.63 |
No indels/gaps allowed in any method. We report on both the running time and the percentage of reads mapped. Fastest run times for highest sensitivity values are shown in boldface.
Comparing mrsFAST-Ultra and GSNAP in SNP-tolerant best mapping mode
| Software | Time | % of reads |
|---|---|---|
| mrsFAST-Ultra | 90 min | 92.27 |
| GSNAP | 207 min | 77.63 |
Memory footprint of the tools on 2M reads
| Software | Memory footprint (GB) |
|---|---|
| mrsFAST-Ultra | 2.5 |
| BWA | 3.2 |
| Bowtie2 | 3.2 |
| GEM | 4.1 |
| RazerS3 | 3.1 |
| GSNAP | 4.6 |
| SRmapper | 2.5 |
Figure 1.Average Number of locations verified per k-mer extracted from each read, as a function of k. Note that the maximum value of k for the original mrsFAST is 14—even if higher values of k may be demanded by a user.