| Literature DB >> 26719890 |
Matija Korpar1, Martin Šošić1, Dino Blažeka1, Mile Šikić1,2.
Abstract
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.Entities:
Mesh:
Year: 2015 PMID: 26719890 PMCID: PMC4699916 DOI: 10.1371/journal.pone.0145857
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1GPU long kernel execution.
Each thread in SW#db long kernel solves four rows using optimized CUDA structures.
Fig 2Database processing steps.
Sequences in the database are sorted by their length and divided into two partitions. In the both partitions GPU kernels (short and long) process from shorter to longer sequences and OPAL (CPU implementation) processes in the opposite direction.
Comparison of running times for SW#db, BLASTP, CUDASW++ v2.0, SSW and SSEARCH using ASTRAL database as a query file.
We could not run CUDASW++ 3.1 on the both machines (segmentation fault). Both versions of CUDASW++ could not run on Uniref90 due to the size of database. We did not measure running time of SSW for Uniprot90 because it would last too long.
| Database | Configuration | Running time (s) | ||||
|---|---|---|---|---|---|---|
| SW#db | BLASTP | SSEARCH | CudaSW++ v2.0 | SSW | ||
| Swiss-prot | Single-GPU; Nvidia GeForce GTX 780 card | 3523 | 3494 | 15123 | 23795 | 87118 |
| Uniref90 | Single-GPU; Nvidia GeForce GTX 780 card | 123581 | 73117 | 490543 | - | - |
| Swiss-prot | Multi-GPU; 2×Nvidia Tesla K80 cards | 1264 | 2210 | 6063 | 30174 | - |
| Uniref90 | Multi_GPU; 2×Nvidia Tesla K80 cards | 41019 | 29597 | 164188 | - | - |
Fig 5Comparison of sensitivity of BLASTP and SW#db on the Astral/SCOP database.
The list of Uniprot IDs and lengths of proteins used in performance testing.
| Uniprot ID | Length (residues) |
|---|---|
| O74807 | 110 |
| P19930 | 195 |
| B8E1A7 | 299 |
| Q3ZAI3 | 390 |
| P18080 | 513 |
| O84416 | 607 |
| A9BIH4 | 727 |
| Q2LR26 | 804 |
| B4KLY7 | 980 |
| Q5R7Y0 | 1465 |
| Q700K0 | 5124 |
| P0C6V8 | 6733 |
| P0C6W9 | 7094 |
| O01761 | 8081 |
| Q6GGX3 | 10746 |
| Q9I7U4 | 18141 |
| Q8WXI7 | 22152 |
| Q3ASY8 | 36805 |