| Literature DB >> 20080505 |
Abstract
MOTIVATION: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time.Entities:
Mesh:
Year: 2010 PMID: 20080505 PMCID: PMC2828108 DOI: 10.1093/bioinformatics/btp698
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Prefix trie and prefix DAWG of string ‘GOOGOL’. (A) Prefix trie. Symbol ‘∧’ marks the start of a string. The two numbers in a node gives the SA interval of the node. (B) Prefix DAWG constructed by collapsing nodes with the identical SA interval. For example, in the prefix trie, three nodes has SA interval [4, 4]. Their parents have interval [1, 2], [1, 2] and [1, 1], respectively. In the prefix DAWG, the [4, 4] node thus has parents [1, 2] and [1, 1]. Node [4, 4] represents three strings ‘OG’, ‘OGO’ and ‘OGOL’ with the first two strings being the prefix of ‘OGOL’. (A) is modified from Figure 1 in Li and Durbin (2009).
Evaluation on simulated data
| Program | Metrics | 100 bp | 200 bp | 500 bp | 1000 bp | 10 000 bp | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | ||
| BLAT | CPU sec | 685 | 577 | 559 | 819 | 538 | 486 | 1078 | 699 | 512 | 1315 | 862 | 599 | 2628 | 1742 | 710 |
| Q20% | 68.7 | 25.5 | 3.0 | 92.0 | 52.9 | 7.8 | 97.1 | 86.3 | 21.4 | 97.7 | 96.4 | 39.0 | 98.4 | 99.0 | 94.0 | |
| errAln% | 0.99 | 2.48 | 5.47 | 0.55 | 1.72 | 4.55 | 0.17 | 1.12 | 4.41 | 0.01 | 0.52 | 3.98 | 0.00 | 0.00 | 1.28 | |
| BWA-SW | CPU sec | 165 | 125 | 84 | 222 | 168 | 118 | 249 | 172 | 152 | 234 | 168 | 150 | 158 | 134 | 120 |
| Q20% | 85.1 | 62.2 | 19.8 | 93.8 | 88.7 | 49.7 | 96.1 | 95.5 | 85.1 | 96.9 | 96.5 | 95.0 | 98.4 | 98.5 | 98.1 | |
| errAln% | 0.01 | 0.05 | 0.17 | 0.00 | 0.02 | 0.13 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| SSAHA2 | CPU sec | 4872 | 7962 | 9345 | 1932 | 2236 | 5252 | 3311 | 8213 | 6863 | 1554 | 1583 | 3113 | – | – | – |
| Q20% | 85.5 | 83.8 | 78.2 | 93.4 | 93.1 | 91.9 | 96.6 | 96.5 | 96.1 | 97.7 | 97.6 | 97.4 | – | – | – | |
| errAln% | 0.00 | 0.01 | 0.19 | 0.01 | 0.00 | 0.01 | 0.00 | 0.01 | 0.04 | 0.00 | 0.00 | 0.00 | – | – | – | |
Approximately 10 000 000 bp data of different read lengths and error rates are simulated from the human genome. Twenty percent of errors are indel errors with the indel length drawn from a geometric distribution (density: 0.7·0.3). These simulated reads are aligned back to the human genome with BLAT (option -fastMap), BWA-SW and SSAHA2 (option −454 for 100 and 200 bp reads), respectively. The aligned coordinates are then compared with the simulated coordinates to find alignment errors. In each cell in this table, the three numbers are the CPU seconds on a single-core of an Intel E5420 2.5 GHz CPU, percent alignments with mapping quality greater than or equal to 20 (Q20), and percent wrong alignments out of Q20 alignments. SSAHA2 and BWA-SW report mapping quality; BLAT mapping quality is estimated as 250 times the difference of the best and second best alignment scores divided by the best alignment score (essentially the same calculation as the one for BWA-SW).
Summary of alignments inconsistent between the BWA-SW and SSAHA2 on real data
| Condition | Count | BWA-SW | SSAHA2 | ||||
|---|---|---|---|---|---|---|---|
| avgLen | avgDiff | avgMapQ | avgLen | avgDiff | avgMapQ | ||
| BWA-SW≥20; SSAHA2 unmapped | 0 | – | – | – | – | – | – |
| BWA-SW≥20 plausible; SSAHA2<20 | 1188 | 398.2 | 1.3% | 178.4 | 198.3 | 13.1% | 3.9 |
| BWA-SW≥20 questionable | 40 | 183.0 | 7.8% | 41.2 | 280.3 | 9.4% | 2.4 |
| SSAHA2≥20; BWA-SW unmapped | 946 | – | – | – | 75.4 | 6.3% | 51.2 |
| SSAHA2≥20 plausible; BWA-SW<20 | 47 | 129.0 | 9.3% | 2.5 | 200.5 | 8.8% | 34.4 |
| SSAHA2≥20 questionable | 185 | 400.2 | 1.7% | 13.4 | 399.2 | 2.9% | 216.4 |
A total of 137 670 454 reads uniformly selected from SRR003161 were mapped against the human genome with BWA-SW and SSAHA2, respectively. A read is said to be aligned inconsistently if the leftmost coordinates of the BWA-SW and SSAHA2 alignment differs by over 355 bp, the average read length. A score, which equals to the number of matches minus three multiplied by the number of differences (mismatches and gaps) in the aligned region, is calculated for each alignment. A BWA-SW alignment is said to be plausible if the score derived from the BWA-SW alignment minus the one derived from the SSAHA2 alignment of the same read is greater than or equal to 20 (i.e. the BWA-SW alignment is sufficiently better); otherwise the BWA-SW alignment is said to be questionable. Plausible and questionable SSAHA2 alignments are defined in a similar manner. In the table, ‘BWA-SW≥20’ denotes the BWA-SW alignments with mapping quality higher than 20. In all, BWA-SW misses 993 (=946 + 47) alignments which SSAHA2 aligns well, while SSAHA2 misses 1188; 40 BWA-SW Q20 alignments and 185 SSAHA2 Q20 alignments are possibly wrong.