| Literature DB >> 20739310 |
Guillaume Rizk1, Dominique Lavenier.
Abstract
MOTIVATION: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold-achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads.Entities:
Mesh:
Year: 2010 PMID: 20739310 PMCID: PMC2951093 DOI: 10.1093/bioinformatics/btq485
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Computation of a semi-global alignment (only the query sequence needs to be globally aligned), in the case of a maximum of 1 error. (A) Dynamic Programming. With 12 nt long sequences and a traditional dynamic programming algorithm, cell calculations can be limited in a band, but there are still 34 cells to compute. (B) Tiled Algorithm. With a precomputed table score of size 4 × 4, the tiled algorithm is performed in only three steps. The first step requires one table access, while the second and third steps both require three table accesses. Here, the score given by the tiled algorithm is the same as the full dynamic programming algorithm—there are two errors in the alignment. In the general case, the tiled algorithm only gives a lower bound of the number of errors present in the alignment.
Evaluation on real data
| Read | Software | Index (s) | Align (s) | Q20% | Q20 |
|---|---|---|---|---|---|
| length | Error rate (%) | ||||
| 36 bp | GASSST | 1712 | 3211 | 34.5 | 0.14 |
| BFAST | 520 800 | 17 520 | 41.8 | 0.12 | |
| BWA | 5158 | 3739 | 35.4 | 0.17 | |
| PASS | 2312 | 5072 | 41.4 | – | |
| 50 bp | GASSST | 1719 | 4090 | 73.7 | 0.04 |
| BFAST | 520 800 | 22 799 | 80.4 | 0.10 | |
| BWA | 5158 | 3043 | 74.5 | 0.17 | |
| PASS | 2144 | 5384 | 79.3 | – | |
| 76 bp | GASSST | 1701 | 8483 | 81.3 | 0.04 |
| BFAST | 520 800 | 161 220 | 85.4 | 0.28 | |
| BWA | 5158 | 3101 | 86.4 | 0.53 | |
| PASS | 1951 | 118 541 | 87.7 | – |
Datasets consisted, respectively, of 11.9, 6.8 and 8.5 millions of reads of size 36, 50 and 76 bp, of accession numbers SRR002320, SRR039633 and SRR017179. The three datasets were aligned with the whole human genome. The time required to compute the genome index is shown in column 3. For BFAST and BWA, the index was computed only once and stored on disk. Column 4 shows the time required in seconds to align reads, running on a single core of a 2.8 GHz Xeon E4562. Column 5 shows the percentage of reads with a mapping quality greater than or equal to 20 (Q20). Last column is the percentage of Q20 alignments that are probably wrong given the results of other aligners: if a program gives a high mapping quality to a read and another program finds a different alignment of similar alignment score for that read, then the first program is probably wrong.
aSince PASS does not compute mapping qualities, fifth column shows for PASS the percentage of reads with a unique best alignment, and error rate is not computed.
Filtering rate on three datasets
| Data Set | Filter Step | Step filter (%) | Total filter (%) |
|---|---|---|---|
| 36 bp | 64.9 | 64.9 | |
| 72.8 | 90.5 | ||
| 96.6 | 99.7 | ||
| 45.9 | 99.8 | ||
| 63.8 | 99.9 | ||
| 50 bp | 41.7 | 41.7 | |
| 80.4 | 88.6 | ||
| 97.5 | 99.7 | ||
| 88.5 | 99.97 | ||
| 67.5 | 99.99 | ||
| 76 bp | 0.1 | 0.1 | |
| 49.8 | 49.9 | ||
| 93.7 | 96.8 | ||
| 94.8 | 99.8 | ||
| 70.2 | 99.95 |
The first column gives the percentage of hits filtered by a given filter. The second column gives the total percentage of hits filtered by the sequence of filters. Reads were aligned with 10% errors maximum, which means a maximum of 3, 4 and 7 errors, gaps or mismatches, respectively.
Filters execution time
| Step | Execution time | Total percent time |
|---|---|---|
| for one hit (ns) | spent in this step | |
| 13.9 | 33 | |
| 5.8 | 14 | |
| 22.4 | 32 | |
| 78.5 | 15 | |
| 356.7 | 3 | |
| 4486.5 | 3 |
Average time taken by each filter step for one hit of a 50 bp query, in nanoseconds. Note that only the times of TNW7(full) and NW actually depends on query size since the others only work on a bounded size subsequence.
Evaluation on simulated data
| Program | Mode | Metrics | 50 bp | 100 bp | 200 bp | 500 bp | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | 2% | 5% | 10% | |||
| GASSST | fast | Align sec | 584 | 781 | 1720 | 794 | 981 | 1160 | 2030 | 2314 | 3051 | 6573 | 8453 | 11 859 |
| Sensitivity% | 45.8 | 42.8 | 36.1 | 54.2 | 53.3 | 44.9 | 58.6 | 56.3 | 53.8 | 61.0 | 59.8 | 58.8 | ||
| Accuracy% | 99.2 | 98.5 | 93.8 | 90.5 | 89.4 | 86.9 | 91.9 | 91.5 | 89.7 | 93.0 | 92.8 | 91.7 | ||
| GASSST | accurate | Align sec | 1709 | 2741 | 4706 | 1290 | 1887 | 3262 | 3452 | 6173 | 8744 | 12 864 | 18 737 | 34 222 |
| Sensitivity% | 46.7 | 44.6 | 37.5 | 51.0 | 50.3 | 43.5 | 53.4 | 52.8 | 51.7 | 57.5 | 55.7 | 55.5 | ||
| Accuracy% | 99.8 | 99.3 | 93.5 | 99.7 | 99.4 | 97.5 | 99.9 | 99.8 | 99.3 | 99.9 | 99.9 | 99.7 | ||
| BFAST | Align sec | 2279 | 2044 | 1756 | 15 263 | 15 787 | 11 452 | – | – | – | – | – | – | |
| Sensitivity% | 46.5 | 43.0 | 32.1 | 52.7 | 51.0 | 48.5 | – | – | – | – | – | – | ||
| Accuracy% | 98.8 | 96.1 | 85.2 | 99.0 | 98.7 | 95.8 | – | – | – | – | – | – | ||
| BWA | Align sec | 792 | 1392 | 1572 | 1862 | 4941 | 3364 | 4660 | 2145 | 185 | – | – | – | |
| Sensitivity% | 48.2 | 38.6 | 16.8 | 54.8 | 41.0 | 7.3 | 53.3 | 11.9 | 0.1 | – | – | – | ||
| Accuracy% | 99.2 | 97.4 | 93.5 | 99.7 | 99.0 | 97.9 | 99.8 | 99.6 | 96.7 | – | – | – | ||
| BWA-SW | Align sec | – | – | – | – | – | – | 4699 | 3546 | 2365 | 13 027 | 9646 | 7835 | |
| Sensitivity% | – | – | – | – | – | – | 54.9 | 50.3 | 25.2 | 57.3 | 56.1 | 45.4 | ||
| Accuracy% | – | – | – | – | – | – | 99.4 | 96.9 | 85.7 | 99.2 | 96.8 | 85.2 | ||
| SSAHA2 | Align sec | – | – | – | 27 740 | 41 978 | 45 295 | 22 285 | 27 504 | 65 420 | 179 095 | 415 252 | 275 622 | |
| Sensitivity% | – | – | – | 45.3 | 43.5 | 38.6 | 53.2 | 51.4 | 48.4 | 59.5 | 58.8 | 55.8 | ||
| Accuracy% | – | – | – | 99.8 | 99.1 | 95.3 | 99.8 | 99.1 | 96.0 | 99.8 | 99.2 | 95.3 | ||
| PASS | Align sec | 2012 | 2281 | 5085 | 14 387 | 26 033 | 30 022 | 103 338 | 139 436 | 180 943 | – | – | – | |
| Sensitivity% | 50.0 | 43.8 | 31.3 | 51.6 | 37.5 | 16.4 | 49.3 | 16.6 | 2.8 | – | – | – | ||
| Accuracy% | 96.6 | 93.2 | 82.4 | 98.5 | 94.0 | 86.8 | 97.2 | 93.6 | 92.4 | – | – | – | ||
Data sets containing one million reads each are simulated from the human genome with different lengths and error rate. Twenty percent of errors are indel errors with indel length l drawn from a geometric distribution of density 0.7 · 0.3. The alignment time in seconds only includes the fraction of the total time proportional to the number of reads, i.e, not the time spent in computing or loading the index of the human genome, running on a single core of a 2.8 GHz Xeon E4562 . Computed alignment coordinates are compared to the true simulated coordinates to find sensitivity and accuracy. Sensitivity is the percentage of reads correctly mapped, while accuracy is the percentage of mapped reads that are correctly mapped. A read is considered mapped if it has a unique best alignment. A mapped read is considered correct if it is within 10 bp of its true coordinates. We filtered alignments having a mapping quality less than 20, except for PASS, which does not give mapping quality.