| Literature DB >> 18974889 |
Morgan N Price1, Paramvir S Dehal, Adam P Arkin.
Abstract
BACKGROUND: All-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2008 PMID: 18974889 PMCID: PMC2571987 DOI: 10.1371/journal.pone.0003589
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of FastBLAST.
FastBLAST reduction of NR.
| Step in | Subset | %Identity Threshold for Clustering | Sequences (millions) | Size (billions of amino acids) | Relative size | Alignments (millions) |
| – | All sequences | None | 6.53 | 2.23 | 100.0% | – |
| 1 | Known families | None | – | 1.72 | 77.2% | 214.7 |
| 5 | Known families | 33% | 2.28 | 0.70 | 31.3% | – |
| 2 | Unassigned regions | None | 2.93 | 0.48 | 21.4% | – |
| – | Unassigned regions | 90% | 2.20 | 0.37 | 16.6% | – |
| 3 | Unassigned regions | 65% | 1.86 | 0.32 | 14.1% | – |
| 4 | Unassigned regions | 40% | 1.49 | 0.25 | 11.3% | – |
| 8 | Ad-hoc families | None | – | 0.65 | 29.2% | 17.4 |
| – | All families | None | – | 2.13 | 95.7% | 232.1 |
Sequence clusters from known families (clustered at 33% identity and merged).
All “unassigned” regions of at least 30 amino acids that do not belong to any of the known families. FastBLAST ignores short linkers between two regions that belong to known families.
Sequence clusters not from known families, clustered with CD-HIT or by analyzing BLAST hits.
Total number of amino acids that belong to any of these families. Because of overlapping hits to families, this is far less than the total length of all the alignments.
Total length of the exemplars of the clusters.
Figure 2FastBLAST misses mostly low-ranking hits and/or weak hits.
We show the cumulative proportion of queries that have a miss within the top n hits. Note the log-scale for the x axis. The highest proportion is 10.8% because FastBLAST identified all of the top 3,250 homologs at 70 bits or greater for the other 89.2% of queries. We also show results if only higher-scoring hits are considered.