| Literature DB >> 16061938 |
You Jung Kim1, Andrew Boyd, Brian D Athey, Jignesh M Patel.
Abstract
A common task in many modern bioinformatics applications is to match a set of nucleotide query sequences against a large sequence dataset. Existing tools, such as BLAST, are designed to evaluate a single query at a time and can be unacceptably slow when the number of sequences in the query set is large. In this paper, we present a new algorithm, called miBLAST, that evaluates such batch workloads efficiently. At the core, miBLAST employs a q-gram filtering and an index join for efficiently detecting similarity between the query sequences and database sequences. This set-oriented technique, which indexes both the query and the database sets, results in substantial performance improvements over existing methods. Our results show that miBLAST is significantly faster than BLAST in many cases. For example, miBLAST aligned 247 965 oligonucleotide sequences in the Affymetrix probe set against the Human UniGene in 1.26 days, compared with 27.27 days with BLAST (an improvement by a factor of 22). The relative performance of miBLAST increases for larger word sizes; however, it decreases for longer queries. miBLAST employs the familiar BLAST statistical model and output format, guaranteeing the same accuracy as BLAST and facilitating a seamless transition for existing BLAST users.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16061938 PMCID: PMC1182166 DOI: 10.1093/nar/gki739
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
An example of a q-gram index structure
| Sequence ID | Sequence | Word (w) | Sequence ID |
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 |
The two left columns represent a sequence dataset and the two right columns show the index built on the dataset using l = 3.
miBLAST (Q, D, l)
| INPUT: |
| • |
| • |
| • |
| • |
| • |
| VARIABLES: |
| • |
| • |
| • |
| • |
| • |
| • |
| 1: Build a |
| 2: |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: Find alignments with a |
| 12: |
| 13: |
INDEX-JOIN FILTER (DIndex, Q, l, m)
| 1: Build a |
| 2: Create a |
| 3: |
| 4: |
| 5: |
| 6: |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: |
| 25: return |
SLIDING-WINDOW FILTER(DIndex, Q, l, m)
| 1: Create a |
| 2: Create a |
| 3: |
| 4: |
| 5: |
| 6: initialize the |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: |
| 14: |
| 15: |
| 16: |
| 17: |
| 18: |
| 19: |
| 20: |
| 21: |
| 22: |
| 23: |
| 24: for |
| 25: |
| 26: |
| 27: |
| 28: |
| 29: |
| 30: |
| 31: |
| 32: return |
Figure 1This graph shows the relative speedup of each method compared with naive BLAST, for various workload sizes using a word size of 11. (a) Affymetrix (25 bp). (b) Illumina (70 bp).
The detailed execution time of miBLAST for the experimental results shown in Figure 1
| Workload size | Filtering cost per query (25 bp) | Alignment cost per query (25 bp) | Filtering cost per query (70 bp) | Alignment cost per query (70 bp) |
|---|---|---|---|---|
| 1000 | 0.44 (54%) | 0.37 (46%) | 0.46 (31%) | 1.02 (69%) |
| 2000 | 0.23 (42%) | 0.33 (58%) | 0.25 (21%) | 0.96 (79%) |
| 3000 | 0.17 (36%) | 0.31 (64%) | 0.18 (16%) | 0.94 (84%) |
| 4000 | 0.14 (31%) | 0.30 (69%) | 0.17 (15%) | 0.94 (85%) |
All times reported here are in s.
Figure 2This graph shows the effect of the BLAST word size parameter on the query performance for each method, plotted as relative speedup over naive BLAST. The batch size used in this experiment is 4000 queries. miBLAST uses an index word size of 11 and uses a sliding-window filtering method for query word sizes between 14 and 23. (a) Affymetrix (25 bp). (b) Illumina (70 bp).
The effect of the BLAST word size parameter on the filtration ratio and the alignment cost in miBLAST
| Word size | Average filtration ratio | Alignment cost per query |
|---|---|---|
| 11 | 0.54964% | 0.30 |
| 14 | 0.03924% | 0.13 |
| 17 | 0.02231% | 0.11 |
| 20 | 0.02115% | 0.09 |
| 23 | 0.02009% | 0.08 |
The query set used for collecting this data is 4000 queries from the Affymetrix dataset. All times reported here are in s.
The average execution time per query for the results shown in Figure 2a, for a batch size of 4000
| Word size | naive BLAST | BLAST-B | miBLAST |
|---|---|---|---|
| 11 | 9.50 | 1.95 | 0.44 |
| 14 | 9.26 | 1.61 | 0.38 |
| 17 | 9.46 | 1.58 | 0.37 |
| 20 | 9.45 | 1.58 | 0.30 |
| 23 | 9.44 | 1.56 | 0.21 |
All times reported here are in s.
Figure 3This graph shows the relative speedup to naive BLAST for various query lengths, using a word size of 11. Queries are drawn from the EST human dataset, and each batch has 1000 queries.
Figure 4This graph shows the execution time of each method for various word sizes using a batch of 1000 queries from the Affymetrix probe set (25 bp).