| Literature DB >> 22039206 |
Yongan Zhao1, Haixu Tang, Yuzhen Ye.
Abstract
SUMMARY: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20-90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search-another 2-3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22039206 PMCID: PMC3244761 DOI: 10.1093/bioinformatics/btr595
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
The comparison of the running time of BLAST, RAPSearch and RAPSearch2
| Database | Query | Running time (min) | ||||||
|---|---|---|---|---|---|---|---|---|
| Dataset | Number of reads | Read length (nt) | BLAST | RAPSearch | RAPSearch2 (# threads) | |||
| 1 | 4 | 8 | ||||||
| IMG (1.6G) | SRR020796 | 1 164 805 | 72 | 95 800 | 1170 | 587 | 170 | 100 |
| 4440037 | 188 445 | 100 | 9240 | 378 | 120 | 36 | 22 | |
| TS28 | 622 554 | 200 | 67 000 | 3872 | 1341 | 331 | 242 | |
| TS50 | 312 665 | 329 | 39 200 | 4105 | 1512 | 385 | 281 | |
| NCBI NR (3.2G) | SRR020796 | 271 000 | 2910 | 1229 | 362 | 250 | ||
| 4440037 | 25 680 | 889 | 335 | 110 | 58 | |||
| TS28 | 177 900 | 8471 | 3019 | 859 | 518 | |||
| TS50 | 103 600 | 9195 | 3545 | 901 | 644 | |||
a The running time was estimated using 1% of the original query dataset; the actual BLAST search of the original datasets was carried out on a computer cluster. Note that we compared RAPSearch with BLAST (blast2.2.18) and BLAST+ (blast+-2.2.23). The comparison showed that BLAST and BLAST+ have almost identical sensitivity (but BLAST+ is twice as slow), so we only show the comparison with BLAST in this article (and the speedup will be even greater if we compare RAPSearch to BLAST+).
bThe SRR020796 dataset was downloaded from the NCBI website (from the rumen microbiota response study), and only 2 of the reads were used for testing.
cThe dataset was from the nine biomes project (Dinsdale ).
dTS50 (accession number: 4440615.3) and TS28 (4440613.3) datasets were from the Twin Study (Turnbaugh ). 4440037, TS50 and TS28 datasets were downloaded from the MG-RAST server (http://metagenomics.anl.gov/).