| Literature DB >> 23734846 |
Abstract
BACKGROUND: Read alignment is a computational bottleneck in some sequencing projects. Most of the existing software packages for read alignment are based on two algorithmic approaches: prefix-trees and hash-tables. We propose a new approach to read alignment using random permutations of strings.Entities:
Mesh:
Year: 2013 PMID: 23734846 PMCID: PMC3622637 DOI: 10.1186/1471-2105-14-S5-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sorted library
| Sorted Position | DNA Locus | Original String |
|---|---|---|
| 9383 | 8111 | CTCGATTGGAACTGA |
| 9384 | 17930 | CTCGCAATCCGCAAA |
| 9385 | 3710 | CTCGCAGTGTCAAAC |
| 9386 | 1608 | CTCGCATCAAAGGTT |
| 9387 | 10000 | CTCGCCAAAGCCATG |
| 9388 | 17832 | CTCGCCCACCTATTA |
| 9389 | 1034 | CTCGCCGGTCTAGTC |
| 9390 | 19834 | CTCGCGCGGTCAACT |
| 9391 | 6422 | CTCGCGTCGGGCGAA |
Sorted library of permuted strings
| Sorted Position | DNA Location | Original String | Permuted String |
|---|---|---|---|
| 8898 | 997 | ATTACGATAACAACG | CTAAAGACAAACTTG |
| 8899 | 11316 | CTGAGCATAGCTACG | CTAAAGAGCTGCGTC |
| 8900 | 4844 | GTTAGGAAAACAACG | CTAAAGAGGAACTAG |
| 8901 | 9523 | GTGCCCAAATCGATG | CTAAAGCCGGTTGAC |
| 8902 | 10000 | CTCGCCAAAGCCATG | CTAAAGGCCCGTCAC |
| 8903 | 4568 | TTTGTAAGATCTACG | CTAAAGGTTTTCTGA |
| 8904 | 16699 | CTCTCCATAGCCAAG | CTAAAGTCCCGACTC |
| 8905 | 9139 | GTGTCTAGAGCTATG | CTAAAGTCGTGTGGT |
| 8906 | 1115 | GTTTGGAGAGCGAGG | CTAAAGTGGGGGTGG |
Figure 1Single-end alignment of real reads: best match. The percent of reads for which an alignment with up to n mismatches was found. Additional alignments are ignored. Dataset: 105 reads from ERR009392_1.
Real single-end reads: search time
| Software | Search time (s) | ||
|---|---|---|---|
| Bowtie -v 3 | 233 | 256 | 773 |
| Bowtie -n 2 | 144 | 334 | 1560 |
| Bowtie -n 2 -k 10 | 658 | 1142 | 2830 |
| Bowtie2 -very-fast | 179 | 285 | 440 |
| Bowtie2 -sensitive | 328 | 654 | 853 |
| Bowtie2 -very-sensitive | 812 | 1488 | 1855 |
| Bowtie2 -very-sensitive -k 10 | 1121 | 2430 | 3869 |
| BWA -o 0 | 548 | 860 | 2434 |
| Permutations-based (mode 1) | 65 | 68 | 111 |
| Permutations-based (mode 2) | 147 | 151 | 145 |
| Permutations-based (fast) | |||
Each dataset contained 106 reads from the fastq files obtained from the "1000 Genomes" project.
Figure 2Single-end alignment of simulated reads: search time and correct alignments. Dataset: 106 simulated reads of length 100, mutation rate: 0:1%, indel ratio: 15%, mismatch rate: 2%. The results of additional simulations are reported in table 4.
Simulated single-end reads: search time and percent of correct alignments
| Read length | 75 | 100 | 150 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Software | Time/%Correct | |||||||||
| Bowtie | time (s) | 141 | 245 | 517 | 177 | 350 | 588 | 306 | 537 | 647 |
| Bowtie | time (s) | 58 | 90 | 165 | 79 | 125 | 190 | 126 | 194 | 218 |
| Bowtie | time (s) | 152 | 236 | 439 | 234 | 393 | 961 | 397 | 733 | 2430 |
| Bowtie | time (s) | 108 | 161 | 286 | 170 | 264 | 609 | 294 | 514 | 1560 |
| Bowtie | time (s) | 873 | 869 | 1179 | 1242 | 1069 | 2377 | 1511 | 1243 | 3446 |
| Bowtie2 | time (s) | 169 | 163 | 136 | 234 | 233 | 196 | 363 | 366 | 299 |
| Bowtie2 | time (s) | 224 | 223 | 178 | 329 | 320 | 267 | 533 | 523 | 422 |
| Bowtie2 | time (s) | 354 | 315 | 275 | 498 | 462 | 396 | 791 | 765 | 638 |
| Bowtie2 | time (s) | 776 | 734 | 611 | 1124 | 1056 | 851 | 1864 | 1730 | 1383 |
| Bowtie2 | time (s) | 1063 | 813 | 691 | 1743 | 1341 | 1165 | 3498 | 2775 | 2395 |
| BWA | time (s) | 320 | 457 | 516 | 404 | 743 | 610 | 715 | 931 | 447 |
| BWA | time (s) | 359 | 560 | 889 | 483 | 1007 | 1206 | 904 | 1513 | 1034 |
| Permutations-based | time (s) | 53 | 76 | 106 | 56 | 90 | 118 | 76 | 98 | 112 |
| Permutations-based | time (s) | 147 | 149 | 152 | 155 | 147 | 145 | 145 | 148 | 156 |
| Permutations-based | time (s) | |||||||||
Each dataset contained 106 reads. Mutation rate: 0.1%, indel ratio: 15%.
We report the search time (for BWA: overall run time) and the percent of the reads which were aligned to the correct position in the genome.
In some cases and in some settings, the programs may report several possible alignments for some reads. When needed, additional filtering can be added to aligners in order to eliminate some of the results, as appropriate for specific applications. In the experiment, the programs reported <4 alignments/read (in average, in sensitive modes), with 0-1 alignments for the majority of reads. In this table, when the program produces multiple possible alignments, it is enough that one of the reported alignments corresponds to the correct location in order to consider the alignment correct.
Real paired-end reads: search times.
| Software | Search time (s) | |
|---|---|---|
| Bowtie -v 3 | 2004 | 2145 |
| Bowtie -v 2 | 315 | |
| Bowtie -n 3 | 628 | 718 |
| Bowtie2 -very-fast | 650 | 848 |
| Bowtie2 -fast | 740 | 961 |
| Bowtie2 -sensitive | 978 | 1351 |
| Bowtie2 -very-sensitive | 1749 | 2576 |
| BWA -o 0 | 903 | 2001 |
| BWA -o 1 | 1707 | 3540 |
| Permutations-based (report one) | 345 | |
| Permutations-based (report more) | 608 | 488 |
Each dataset contained 106 pairs of reads from the fastq files obtained from the "1000 Genomes" website. Search times are reported.
Figure 3Paired-end alignment of simulated reads: search time and correct alignments. Dataset: 106 pairs of reads of length 100, mutation rate: 0:1%, indel ratio: 15%, mismatch rate: 2%. Additional results are reported in table 6.
Simulated paired-end reads: search time and percent of correct alignments.
| Low indel probability | High indel probability | ||||||
|---|---|---|---|---|---|---|---|
| Software | Time/%Correct | ||||||
| Bowtie | time (s) | 1766 | 2274 | 3068 | 1849 | 2375 | 3059 |
| Bowtie | time (s) | 6161 | 5283 | 3260 | 5793 | 5151 | 3246 |
| Bowtie | time (s) | 241 | 333 | 353 | 256 | 337 | 352 |
| Bowtie | time (s) | 483 | 593 | 645 | 486 | 618 | 664 |
| Bowtie | time (s) | 1308 | 1293 | 1054 | 1283 | 1199 | 1036 |
| Bowtie2 | time (s) | 757 | 715 | 506 | 734 | 670 | 484 |
| Bowtie2 | time (s) | 834 | 800 | 571 | 804 | 750 | 551 |
| Bowtie2 | time (s) | 1104 | 1079 | 893 | 1062 | 1014 | 857 |
| Bowtie2 | time (s) | 2070 | 2045 | 1671 | 2080 | 1968 | 1648 |
| BWA | time (s) | 1026 | 1404 | 1670 | 1162 | 1379 | 1676 |
| BWA | time (s) | 1252 | 1956 | 3062 | 1499 | 2274 | 3180 |
| Permutations-based | time (s) | ||||||
| Permutations-based | time (s) | 227 | 328 | 530 | 268 | 365 | 553 |
The "low indel probability" datasets were generated with mutation rate: 0.1%, and indel ratio: 15%. The "high indel probability" datasets were generated with mutation rate: 0.1% and indel ratio: 100%. Each of the datasets contains 106 pairs of 100 character-long reads.
For each dataset and each program, we report the search time (for BWA: overall run time) and the percent of the reads which were aligned to the correct position in the genome.
In some cases and in some settings, the programs may report several possible alignments for some reads. When needed, additional filtering can be added to aligners in order to eliminate some of the results, as appropriate for specific applications. In the experiment, the programs reported <3 alignments/read (in average, in sensitive modes), with 0-1 alignments for the majority of reads. In this table, when the program produces multiple possible alignments, it is enough that one of the reported alignments corresponds to the correct location in order to consider the alignment correct.