| Literature DB >> 18567917 |
Aleksandr Morgulis1, George Coulouris, Yan Raytselis, Thomas L Madden, Richa Agarwala, Alejandro A Schäffer.
Abstract
MOTIVATION: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18567917 PMCID: PMC2696921 DOI: 10.1093/bioinformatics/btn322
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic of the data structure used for the database index.
Database index size for different source databases
| Source database | Database | Index size | Bound |
|---|---|---|---|
| size (Mbp) | (Mbytes) | (Mbytes) | |
| 116.78 | 229.60 | 250.62 | |
| 116.78 | 204.74 | 250.62 | |
| Human chr. 1-5, unmasked | 1025.20 | 1197.79 | 1204.46 |
| Human chr. 6-13, unmasked | 1077.86 | 1254.13 | 1259.75 |
| Human chr. 14-Y, unmasked | 767.77 | 926.56 | 934.16 |
| Human chr. 1-8, masked | 1493.03 | 1229.81 | 1695.68 |
| Human chr. 9-Y, masked | 1377.79 | 1137.62 | 1574.68 |
| Mouse chr. 1-7, unmasked | 1140.61 | 1297.20 | 1325.64 |
| Mouse chr. 8-16, unmasked | 1074.64 | 1222.98 | 1256.37 |
| Mouse chr. 17-Y, unmasked | 428.82 | 538.65 | 578.26 |
| Mouse chr. 1-10, masked | 1526.66 | 1258.19 | 1730.99 |
| Mouse chr. 11-Y, masked | 1117.42 | 932.09 | 1301.29 |
Windowmasker software (Morgulis et al., 2006a) was used to generate masked databases. Windowmasker was run with default parameters and low complexity masking enabled. The bound column shows the upper bound calculated from formula 1.
Performance comparison of miBLAST and indexed MegaBLAST
| Set | No. of results | Time (s) | Faster | |||
|---|---|---|---|---|---|---|
| MI | MB | MI | MB | Tied | ||
| Qsmall | <5000 | 170.20 | 23.21 | 0 | 89 | 0 |
| ≥ 5000 | 161.11 | 182.72 | 10 | 1 | 0 | |
| Qmedium | <5000 | 129.16 | 39.26 | 0 | 17 | 0 |
| ≥ 5000 | 15340.13 | 4237.12 | 6 | 75 | 2 | |
The time represents the sum of median results per query. In this table and in Table 3, the number of results refers to the number of alignments reported, not the number of database sequences with at least one reported alignment. The queries are grouped based on the number of results. The three rightmost columns count the number of queries for which either software package was faster or the running times were considered a tie. The two running times for a query were considered a tie, if the time difference is <0.1s.
Running time in seconds for baseline and indexed versions of MegaBLAST in the case of the unmasked human genome database
| Set | <5000 results | ≥5000 results | ||||
|---|---|---|---|---|---|---|
| Count | Baseline | Indexed | Count | Baseline | Indexed | |
| Qlarge | 1 | 5.08 | 3.49 | 61 | 44173.96 | 43888.14 |
| 1.46 | 0.29 | 961.24 | 1396.93 | |||
| Qmedium | 13 | 82.17 | 71.95 | 87 | 13009.04 | 13325.62 |
| 24.06 | 6.29 | 305.10 | 745.59 | |||
| Qsmall | 88 | 118.05 | 56.51 | 12 | 528.85 | 570.84 |
| 68.19 | 7.24 | 13.70 | 65.86 | |||
For each query set the top row contains the number of queries in the corresponding group and the total search time. The second row shows the time taken by the seed search phase only. For 38 queries in Qlarge, at least one version of MegaBLAST ran out of memory due to the large number of results.
Running time in seconds for baseline and indexed versions of MegaBLAST in the case of masked genomes and query masked (indicated by yes) or unmasked (indicated by no)
| Human masked | Mouse masked | |||||
|---|---|---|---|---|---|---|
| Set | Baseline | Indexed | Baseline | Indexed | ||
| Yes | Yes | No | Yes | Yes | No | |
| Qlarge | 1097.11 | 511.76 | 418.31 | 1702.35 | 736.74 | 613.17 |
| 447.42 | 108.32 | 127.60 | 447.50 | 98.71 | 114.07 | |
| Qmedium | 266.43 | 90.80 | 95.36 | 437.45 | 146.63 | 140.94 |
| 176.21 | 23.97 | 28.64 | 162.15 | 22.82 | 25.63 | |
| Qsmall | 104.79 | 27.24 | 26.44 | 148.98 | 64.93 | 67.76 |
| 76.83 | 5.27 | 5.33 | 70.10 | 5.32 | 5.11 | |
| Quser | 144.06 | 46.12 | NA | 164.21 | 82.82 | NA |
| 93.49 | 8.27 | NA | 83.18 | 7.01 | NA | |
For each query set the top row is the total search time, and the second row is the time taken for the seed search phase only. NA—not applicable.
Fig. 2.Wall-clock times for 100 indexed and non-indexed searches in a production setting, as a function of logarithm of query length. Considering times to be tied if they are within 0.01 s, indexed search is faster 75 times, non-indexed search is faster 19 times, and they tie on 6 queries. Indexed search is faster on shorter queries and slower on the longest queries.