| Literature DB >> 23358824 |
Enrico Siragusa1, David Weese, Knut Reinert.
Abstract
We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2-4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seeds, compared with exact seeds, increase filtration specificity while preserving sensitivity. Multiple backtracking amortizes the cost of searching a large set of seeds by taking advantage of the repetitiveness of next-generation sequencing data. Combined together, these two methods significantly speed up approximate search on genomic data sets. Masai is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and binaries for Linux, Mac OS X and Windows can be freely downloaded from http://www.seqan.de/projects/masai.Entities:
Mesh:
Year: 2013 PMID: 23358824 PMCID: PMC3627565 DOI: 10.1093/nar/gkt005
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Filtration strategies. A read r occurs in the reference genome g within edit distance 5. (a) If we partition r in six seeds, at least one seed (in white) occurs exactly in g. (b) Alternatively, if we partition r in three seeds, at least one seed (in white) occurs within edit distance 1 in g.
Figure 2.Multiple backtracking. (a) A part of the suffix trie representing the text GGTAACGGTGCGGGC (Supplementary Figure S1). Numbers on the leaves are suffix positions in the text, whereas letters on the inner nodes are arbitrary and serve to distinguish nodes from each other. (b) The trie representing the set of patterns {GGTT, GTAT, GTGG}, respectively numbered {0, 1, 2}. Labels on the leaves show pattern numbers, whereas labels on the inner nodes are again arbitrary identifiers. (c) Recursive calls performed by Algorithm 2 called with arguments {g, s, 1}. Edges represent comparisons performed by Algorithm 2 at line 10 or by Algorithm 1 at line 6, nodes with curly brackets represent recursive calls, rectangular leaves represent approximate matches reported. In this example, pattern numbered 0 (GGTT) matches the text twice, at positions 0 and 6, within 1 mismatch. For simplicity, we omitted terminator symbols in the picture.
Rabema benchmark results
Rabema scores in percentage (average fraction of edit distance locations reported per read). Large numbers show total scores in each Rabema category, and small numbers show the category scores separately for reads with errors.
Variant detection results
We show the percentages of found origins (recall) and fraction of unique reads mapped to their origin (precision) classed by reads with s SNPs and i indels .
Runtime results
Results of mapping Illumina reads. Mapped reads: In large, we show the percentage of mapped reads and in small the cumulative percentage of reads that were mapped with errors. Rabema any-best: In large, we show the percentage of reads mapped with the minimal number of errors (up to 5%) and in small the percentage of reads that were mapped with errors. Remarks: SHRiMP 2 was not able to map the H. sapiens data set within 4 days. Hobbes constantly crashed and was not able to map completely neither the C. Elegans nor the H. sapiens data set.