| Literature DB >> 22962448 |
Md Pavel Mahmud1, John Wiedenhoeft, Alexander Schliep.
Abstract
MOTIVATION: Mapping billions of reads from next generation sequencing experiments to reference genomes is a crucial task, which can require hundreds of hours of running time on a single CPU even for the fastest known implementations. Traditional approaches have difficulties dealing with matches of large edit distance, particularly in the presence of frequent or large insertions and deletions (indels). This is a serious obstacle both in determining the spectrum and abundance of genetic variations and in personal genomics.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22962448 PMCID: PMC3436807 DOI: 10.1093/bioinformatics/bts380
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Most approaches to approximate string matching using Ukkonen's q-gram lemma rely on the existence of reasonably large q-grams which are exact matches between pattern and text. These can be found efficiently with a number of techniques and yield putative hits, which are then evaluated using an alignment algorithm. For each pattern and each putative hit, the number of shared q-grams is evaluated de novo (left). We map both reads and genome locations to vectors of 2-gram frequencies and identify approximate matches finding nearest neighbors (right). This is accelerated by the use of a spatial index structure, e.g. a kd-tree, which is created by recursively partitioning the input space around the median value of a dimension.
Fig. 2.Comparison of popular read mappers with TreQ. Accuracy is defined as the percentage of single best reads that are mapped to the exact genomic location they were drawn from in the simulation. Notice that TreQ outperforms most popular read mappers and is mostly on par with LAST
One million randomly selected Illumina single end reads mapped to HG18 build 36
| Technique | Algorithm | Parameters | Time (h:m) | Mapped percentage ≤ ED | |||
|---|---|---|---|---|---|---|---|
| ≤ 3 | ≤ 6 | ≤ 12 | ≤ 18 | ||||
| Bowtie | 0:04 | 85.22 | – | – | – | ||
| 0:04 | 86.85 | – | – | – | |||
| Suffix trie | BWA | default | 0:14 | 87.31 | 89.35 | – | – |
| 5:53 | 87.35 | 90.08 | 92.39 | 93.03 | |||
| SOAP2 | 84.87 | – | – | – | |||
| mrFAST | 19:50 | 87.54 | 90.59 | – | – | ||
| Novoalign | 0:27 | 83.68 | 84.80 | 85.18 | 85.19 | ||
| SSAHA2 | 45:36+ | – | – | – | – | ||
| RazerS | 14:45 | 66.67 | 79.41 | – | – | ||
| default | 1:57 | 85.73 | 88.32 | 90.90 | 92.15 | ||
| Seed-extend | Stampy | 0:38 | |||||
| default | 1:32 | 84.76 | 87.66 | 90.23 | 90.78 | ||
| 1:35 | 84.85 | 87.77 | 90.69 | 91.69 | |||
| LAST | default, LAMA | 4:36 | 68.74 | 71.12 | 73.26 | 73.72 | |
| 4:39 | 39.26 | 40.60 | 41.95 | 42.42 | |||
| 7:00 | 87.34 | 90.12 | 93.06 | 94.67 | |||
| 6:28 | 87.27 | 90.06 | 93.01 | 94.61 | |||
| 6:36 | 87.22 | 90.04 | 93.02 | 94.62 | |||
| Geometric embedding | TreQ | 8:15 | 87.32 | 90.11 | 93.04 | 94.66 | |
| 7:44 | 87.16 | 89.93 | 92.87 | 94.50 | |||
| 8:50 | 86.90 | 89.69 | 92.66 | 94.30 | |||
| Hybrid | SOAP2 + TreQ | 2:06 | 87.89 | 90.50 | 93.26 | 94.83 | |
The percentages of reads mapped within a fixed edit distance (ED) by various read mappers are reported. As expected, trie-based read mappers are very fast but mostly fail to map reads with higher errors. BWA with customized parameters performs well but with significantly increased running time. Seed-extend-based methods have varied outcomes; mrFAST, RazerS and SSAHA2 take significantly more running time than others, Novoalign is comparably fast but fails to map reads with higher edit distances, whereas LAST (without LAMA option) and Stampy map almost similar amount of reads as TreQ. In contrast to most read mappers, TreQ is not restricted to few mismatches, small indels or few number of indels, and maps either an almost similar percentage of reads or more with various different parameter settings. TreQ's running time is significantly lower than mrFAST, RazerS and SSAHA2 and comparable with customized BWA; we stopped SSAHA2 after it did not finish running in 45 h. Additionally, Hybrid TreQ/SOAP outperforms most read mappers, whereas significantly reducing the required running time. Note that Bowtie only allows mismatches and is restricted to at most 3. All running times are based on running the read mappers single threaded on a single core of a 2.2 GHz AMD Opteron processor.
Fig. 3.Speed up achieved by the multi-threaded version of TreQ on the task of mapping a randomly selected 0.1 million 101 bp single end reads from Yoruba African individual (NA18507).