| Literature DB >> 18974170 |
Nawar Malhis1, Yaron S N Butterfield, Martin Ester, Steven J M Jones.
Abstract
MOTIVATION: A plethora of alignment tools have been created that are designed to best fit different types of alignment conditions. While some of these are made for aligning Illumina Sequence Analyzer reads, none of these are fully utilizing its probability (prb) output. In this article, we will introduce a new alignment approach (Slider) that reduces the alignment problem space by utilizing each read base's probabilities given in the prb files.Entities:
Mesh:
Year: 2008 PMID: 18974170 PMCID: PMC2638935 DOI: 10.1093/bioinformatics/btn565
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Slider scans both the lexographically sorted reference database and the lexicographically sorted Px_Reads (s.ol0) input table once to generate all exact matches. Exact matches are stored in the sorted s.m0 table. In this example, the set of input sequences is 6 bp (SZr = 6), which is aligned to a reference database of 10 bp (SZd = 10) oligos created with a sliding window across the reference. Reads that match are indicated in bold and underlined with an example of a unique match indicated by a solid line and that of a multiple match with a dashed line.
Comparison of number of reads aligned at various lengths between Eland, RMAP and Slider
| CT302, 27 bases | CT302, 32 bases | |||||
|---|---|---|---|---|---|---|
| Eland | RMAP | Slider | Eland | RMAP | Slider | |
| U0 | 1 421 114 | 1 435 842 | 1 806 896 | 1 073 725 | 1 092 344 | 1 570 854 |
| U1 | 431 065 | 425 641 | 156 084 | 527 388 | 525 338 | 138 566 |
| U2 | 178 993 | 157 701 | 267 044 | 259 458 | ||
| No. of MB | 54 052 593 | 53 776 925 | 52 844 376 | 58 719 548 | 59 549 564 | 54 562 874 |
| No. of BMM | 789 051 | 741 043 | 156 084 | 1 061 476 | 1 044 254 | 138 566 |
| Percentage of BMM | 1.44 | 1.36 | 0.29 | 1.78 | 1.72 | 0.25 |
Using the first 27 and 32 for CT302 (a high-coverage control BAC), the numbers of reads that are aligned to its reference control BAC are compared using three aligners, Eland with the MPS input, RMAP with Base Quality input and Slider with prb input. The number of reads that are aligned with zero-off (U0), one-off (U1) and two-off (U2) show that Slider aligns a larger number of U0 reads and smaller number of U1 reads than Eland and RMAP. In the last three lines, we can see that the percentage of base mismatches (BMM) calculated by Slider is more than four times smaller than either of Eland or RMAP. We can see that the total number of reads aligned by Slider decreases more than other aligners as the read length increases. This is due to the large increase of LQ reads as the read length increases, and Slider chooses to align only the MPS (U0) of LQ reads, while Eland and RMAP treat LQ reads as regular reads and align them with U0, U1 and U2. This special treatment of LQ reads improves Slider SNPs prediction by reducing the percentage of mismatched bases. Where MB = Matched Bases.
Fig. 2.Probability that a given base mismatch is a true SNP as a function of the read sequence weight.
Alignment time comparison
| One lane average alignment time | Total alignment time | |
|---|---|---|
| Eland | 03:38:28 | 043:41:41 |
| RMAP | 95:00:00+ | 999:59:59+ |
| Slider | 05:06:05 | 028:30:03 |
For each aligner of Eland, RMAP and Slider, this table show the average lane alignment time in (h:m:s) and the total alignment time for the complete 12 lanes with 82.6 million reads in total.
Alignment results
| 27 | 32 | 36 | ||||
| Eland | 2.791 | 76.65 | 3.002 | 79.47 | ||
| RMAP | 2.828 | 76.69 | 3.002 | 79.45 | 3.520 | 81.68 |
| Slider | 1.169 | 77.08 | 1.172 | 80.19 | 1.302 | 83.16 |
Results of aligning sequences from CT302 to its reference RefBAC and the human genome excluding chromosome 6.