| Literature DB >> 28652936 |
Won Cheol Yim1, John C Cushman1.
Abstract
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible and used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. This freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.Entities:
Keywords: BLAST; Distributed computing; Environment; HPC; Parallel processing; Sequence similarity
Year: 2017 PMID: 28652936 PMCID: PMC5483034 DOI: 10.7717/peerj.3486
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1The workflow of DCBLAST.
Sequence query involves submission of a single FASTA file containing multiple sequences. The sequences are then subdivided into multiple sequence query FASTA files to achieve load balancing across multiple computer nodes. After BLAST/BLAST+ searching has been completed, the results from each search are merged into a single output file.
Figure 2Pseudocode for the DCBLAST algorithm to perform query subdividing.
The multiple query sequences (in one FASTA file) are then subdivided into a set of query files until the subdivided lengths exceed that of the total length/N. Once subdivided, the program will then submit the individual subdivided files to the HPC scheduler and BLAST/BLAST+ is carried out. Lastly, the BLAST/BLAST+ output files are merged into a single report file that is returned to the user.
Figure 3Scaling performance of DCBLAST with Arabidopsis query transcripts versus the UniProtKB/Swiss-Prot protein database.
Speed benchmarks shown include processing times in seconds and fold increases in performance when using 1, 8, 16, 32, 64, 128, and 256 CPU cores.
Comparison of the features of DCBLAST with those of existing parallel bioinformatics software for the performance of BLAST/BLAST+ searches.
| Features | DCBLAST | HPC-BLAST | GPU-BLAST | PLAST | mpiBLAST | ScalaBLAST |
|---|---|---|---|---|---|---|
| Parallelize algorithm | MapReduce | MPI | SIMT | Ordered Index Seed | MPI | MPI |
| Hardware requirement | HPC | HPC environment (Xeon & Xeon phi) | NVIDIA | SIMD | HPC environment | HPC environment |
| Prerequisites | Sun Grid Engine, Perl (any version 5), Path::Tiny (Perl module), Data::Dumper (Perl module), Config::Tiny (Perl module) | Intel MPI C/C++ compiler, xild (Intel linker), xiar (Intel archiving) | CUDA | GCC v4.4+ , cmake 2.8+ | mvapich2 v1.4.1 or mvapich2 v1.4.1 2 or mvapich v1.2.0 3 or OpenMPI v1.4.1 or Intel MPI C/C++ compiler | Intel C/C++ compiler, OpenMPI |
| Scalable across multithreads | Yes | Yes | Yes (GPU) | Yes | Yes | No |
| Scalable across multinodes | Yes | Yes | Not applicable | No | Yes | Yes |
| Support BLAST version | All version of NCBI-BLAST+ | All version of NCBI-BLAST+ | Not applicable | Not applicable | NCBI-BLAST 2.2.20 | NCBI-BLAST 1.1.1.1 |
| Bibliography reference | This report | |||||
| Last update | 4/18/17 | 08/25/16 | 02/09/16 | 04/21/16 | 11/28/2012 | 08/12/13 |
| Limitations | None | Only BLASTN and BLASTP | Only BLASTP | Use only single node/similar result to NCBI-BLAST | Limited output format/Older version of BLAST | Older version of BLAST |
Notes.
MPI, Message Passing Interface.
SIMT, Single-Instruction Multiple-Thread.
HPC, High Performance Computing.
NVIDIA, Nvidia corporation.
GPU, Graphics Processing Units.
SIMD, Single Instruction Multiple Data.
SSE, Streaming SIMD Extensions.
CUDA, Compute Unified Device Architecture.
GCC, GNU Compiler Collection.