| Literature DB >> 22829624 |
Abstract
MOTIVATION: With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this.Entities:
Mesh:
Year: 2012 PMID: 22829624 PMCID: PMC3463118 DOI: 10.1093/bioinformatics/bts456
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1(A) Starting at each location in the query, we form a k-mer which is then converted to a hash key by compressing the bases in the k-mer using 2-bits per base. That hash key is then used to directly index into the Hash Array, giving the starting offset and length of the subset of the ROA that contains the collection of reference locations for that k-mer. (B) Next, seed matches from the query and reference that fall along the same diagonal are collected into extended seeds called ‘fragments’ by merging the pre-sorted ROA regions for each query location using a Binary Heap. (C) In any given region of the reference, many fragments can be included in a potential alignment. YAHA uses a graph algorithm to find the set that maximizes the estimated score. In this example, fragments 1, 2 and 4 form the best alignment. (D) During the Optimal Query Coverage algorithm, we will find the best collection of ‘primary’ alignments (green lines) that has the highest non-overlapping sum of scores. Filter By Similarity is then used to determine the remaining ‘secondary’ alignments (blue lines) that are highly similar to any primary alignment. The remaining alignments (red lines) are not included in the output for the query.
Accuracy comparison of YAHA to BWA-SW over 15 datasets generated in a similar fashion as those in the BWA-SW paper
| 100 K 100 bp Reads | 50 K 200 bp Reads | 20 K 500 bp Reads | 10 K 1000 bp Reads | 1 K 10 000 bp Reads | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | TOTALS |
| BWA-SW | ||||||||||||||||
| CPU secs | 160 | 135 | 102 | 220 | 186 | 140 | 259 | 194 | 154 | 219 | 193 | 142 | 155 | 146 | 129 | 2534 |
| % False negatives | 0.44 | 5.21 | 27.4 | 0.00 | 0.13 | 5.44 | 0.00 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 6.61 |
| % Matching | 96.0 | 89.3 | 64.0 | 98.2 | 97.5 | 89.3 | 98.9 | 98.9 | 98.2 | 99.3 | 99.2 | 99.2 | 99.8 | 99.5 | 98.1 | 89.10 |
| % CS ≥ SSEARCH | 2.96 | 2.92 | 2.69 | 1.74 | 1.71 | 1.66 | 1.09 | 1.02 | 1.13 | 0.68 | 0.74 | 0.66 | 0.20 | 0.50 | 0.90 | 2.21 |
| % False positives | 0.56 | 2.53 | 5.85 | 0.11 | 0.70 | 3.63 | 0.01 | 0.12 | 0.56 | 0.02 | 0.02 | 0.12 | 0.00 | 0.00 | 1.00 | 2.09 |
| % Total error | 1.00 | 7.74 | 33.3 | 0.11 | 0.83 | 9.08 | 0.01 | 0.12 | 0.66 | 0.02 | 0.02 | 0.12 | 0.00 | 0.00 | 1.00 | 8.70 |
| YAHA | ||||||||||||||||
| CPU Secs | 284 | 241 | 176 | 212 | 171 | 109 | 245 | 188 | 112 | 108 | 86 | 58 | 81 | 79 | 66 | 2216 |
| % False negatives | 0.32 | 0.12 | 0.55 | 0.03 | 0.00 | 0.02 | 0.00 | 0.01 | 0.00 | 0.01 | 0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.19 |
| % Matching | 96.3 | 95.5 | 91.8 | 98.1 | 97.9 | 97.2 | 99.0 | 98.8 | 98.8 | 99.3 | 99.2 | 99.1 | 99.9 | 99.7 | 99.2 | 96.18 |
| % CS ≥ SSEARCH | 2.83 | 3.03 | 3.72 | 1.71 | 1.77 | 1.87 | 1.04 | 1.18 | 1.06 | 0.71 | 0.81 | 0.76 | 0.10 | 0.30 | 0.80 | 2.42 |
| % False positives | 0.55 | 1.31 | 3.97 | 0.16 | 0.34 | 0.93 | 0.02 | 0.02 | 0.11 | 0.00 | 0.01 | 0.03 | 0.00 | 0.00 | 0.00 | 1.21 |
| % Total error | 0.87 | 1.43 | 4.52 | 0.19 | 0.35 | 0.96 | 0.02 | 0.03 | 0.11 | 0.01 | 0.02 | 0.10 | 0.00 | 0.00 | 0.00 | 1.40 |
The column headings indicate the read length and number of reads in each group, as well as the error rates impressed on the reads. Each query is put into one of four categories (row) depending on the accuracy of the alignment (see text for details). The CPU time in seconds, and total error rate for each run are also shown. The right-most column shows the aggregate runtimes and category percentages.
Results of the sensitivity test
| Alignments | Versus MegaBLAST | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Run | Aligner and parameters | CPU secs | Total | GE50U | A/Sec | GE50U/Sec | >M | =M | <M |
| M | MegaBLAST: wordLen = 15, score = 15 | 190 773 | 4 012 294 854 | 1 827 862 215 | 21 032 | 9581 | 0 | 100 000 | 0 |
| Y1 | YAHA: minMatch = 15, maxHits = 65 525 S | 160 501 | 6 085 988 010 | 2 343 744 189 | 37 919 | 14 603 | 30 638 | 68 357 | 1005 |
| Y2 | YAHA: minMatch = 15, maxHits = 10 000 S | 91 097 | 3 403 790 544 | 1 470 115 221 | 37 364 | 16 138 | 23 789 | 68 387 | 7824 |
| Y3 | YAHA: minMatch = 15, maxHits = 10 000 | 22 385 | 950 852 793 | 327 644 121 | 42 477 | 14 637 | 20 021 | 68 371 | 11 608 |
| Y4 | YAHA: minMatch = 20, maxHits = 650 | 284 | 11 680 597 | 6 796 153 | 41 129 | 23 930 | 716 | 69 536 | 29 748 |
| S1 | SSAHA2: 454 mode | 1850 | 6 066 013 | 5 488 465 | 3279 | 2967 | 834 | 66 634 | 32 532 |
| S2 | SSAHA2: minMatch = 1, maxHits = 10 000 | 937 | 2 633 833 | 1 101 352 | 2811 | 1175 | 120 | 65 622 | 34 258 |
The first two columns give the name and aligner parameters, column 3 gives the runtimes, columns 4–7 contain the total alignments, GE50U alignments, total alignments/second, and GE50U alignments/second, and the last three columns show the number of queries with >, =, or < the number of alignments as the MegaBLAST run.
Fig. 2Histogram of the number of queries in the Y1 YAHA run with varying numbers of greater, equal and fewer GE50U alignments than MegaBLAST (M). Note the log10 scale bucket sizes. The total number of queries above 0 is 30 638 and below 0 is 1005 as in Table 1.
Fig. 3Shown are graphs of the percentage of queries with which each aligner correctly verified an SV breakpoint for various types of SV events versus the amount of CPU time consumed. Note the large improvement with the inclusion of YAHA’s secondary alignments in the Alu dataset. Also note the marked improvement for both BWA-SW and YAHA in the CGR dataset with 4% error rate by changing the AGS parameters to lower the penalty for indels relative to replacements. Still, YAHA outperforms BWA-SW with both sets of AGS parameters. Graphs C and D are shown with the same axes to ease comparison.