| Literature DB >> 19958483 |
Shu-Qi Zhao1, Jun Wang, Li Zhang, Jiong-Tang Li, Xiaocheng Gu, Ge Gao, Liping Wei.
Abstract
BACKGROUND: Next-generation DNA sequencing technologies generate tens of millions of sequencing reads in one run. These technologies are now widely used in biology research such as in genome-wide identification of polymorphisms, transcription factor binding sites, methylation states, and transcript expression profiles. Mapping the sequencing reads to reference genomes efficiently and effectively is one of the most critical analysis tasks. Although several tools have been developed, their performance suffers when both multiple substitutions and insertions/deletions (indels) occur together.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19958483 PMCID: PMC2788372 DOI: 10.1186/1471-2164-10-S3-S2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Performance comparison based on a real dataset.
| Number of mapped reads | Time(min) | Memory(MB) | |
|---|---|---|---|
| BOAT | 4,713,133 | 9,621 | 1,415 |
| SOAP | 4,555,705 | 14,654 | 1,215 |
| RMAP | 4,520,282 | 34,774 | 3,448 |
| SeqMap | 4,339,235 | 18,593 | 20,529 |
| MAQ | 3,879,236 | 1,127 | 2,897 |
8,755,069 RNA-seq profiling Solexa reads were mapped to mouse whole genome with different programs. In this comparison, the maximum mismatch number threshold was set to 3 (including substitutions and indels). The comparison was run on a local Linux box with two Intel quad-core (E7310 @ 1.6 G Hz) CPUs and 64 G RAM (detailed running parameters for each tool were shown in Supplementary Table S1 of Additional File 2). To handle the physical memory limitation of some of the programs BOAT is compared to, reads were mapped against individual chromosomes sequentially. "Time" shows the sum of the execution times, and "Memory" shows the maximal memory usage among those runs.
Performance comparison based on a simulation dataset.
| Number of mapped reads | Recall | Precision | Time(min) | Memory(MB) | |
|---|---|---|---|---|---|
| BOAT | 3,833,479 | 76.56% | 99.41% | 18 | 1,217 |
| RMAP | 2,957,658 | 58.89% | 98.90% | 840 | 2,371 |
| SOAP | 2,872,535 | 56.75% | 97.19% | 9 | 186** |
| MAQ | 2,878,570 | 55.93% | 93.53% | 4 | 1,959 |
| SeqMap* | 2,187,611 | 43.57% | 99.25% | 33 | 12,500 |
5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core (E7310 @ 1.6 G Hz) CPUs and 64 G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches (detailed running parameters for each tool were shown in Supplementary Table S2 of Additional File 2).
* We tried to run SeqMap with up to 5 mismatches, but failed with out-of-memory error. So only 3 mismatches with 1 indel were allowed when running SeqMap.
** As only a small part of the whole genome was used as reference sequence in this benchmark, the memory usage of SOAP is very low. However, when mapping to the whole human genome, at least 14 GB memory is required to run SOAP [5].
Feature comparison of BOAT and other commonly used Solexa read mapping programs
| Maximum number of mismatches allowed | Gapped alignment | Trimming alignment | BLAST-style E-value | Pair-end reads | SNP Calling | |
|---|---|---|---|---|---|---|
| BOAT | No hardcoded limitation | YES | YES | YES | YES | YES |
| RMAP | No hardcoded limitation | NO | NO | NO | NO | NO |
| MAQ | 3 | NO | NO | NO | YES | YES |
| SOAP | 5 | NO | YES* | NO | YES | NO |
| SeqMap | 5 | YES | NO | NO | NO | NO |
* SOAP provided a similar mode called "iterative alignment" by iteratively trimming base pairs at the 3'-end and redoing the alignment until hits are detected or the remaining sequence is too short.
Figure 1Flow chart of the BOAT algorithm. BOAT takes the leading sequence of a read as seed to initialize an alignment and extends the alignment by traversing through the prefix tree that stores the sequence of the read.