| Literature DB >> 25099887 |
Shuji Suzuki1, Masanori Kakuta1, Takashi Ishida1, Yutaka Akiyama1.
Abstract
DNA sequences are translated into protein coding sequences and then further assigned to protein families in metagenomic analyses, because of the need for sensitivity. However, huge amounts of sequence data create the problem that even general homology search analyses using BLASTX become difficult in terms of computational cost. We designed a new homology search algorithm that finds seed sequences based on the suffix arrays of a query and a database, and have implemented it as GHOSTX. GHOSTX achieved approximately 131-165 times acceleration over a BLASTX search at similar levels of sensitivity. GHOSTX is distributed under the BSD 2-clause license and is available for download at http://www.bi.cs.titech.ac.jp/ghostx/. Currently, sequencing technology continues to improve, and sequencers are increasingly producing larger and larger quantities of data. This explosion of sequence data makes computational analysis with contemporary tools more difficult. We offer this tool as a potential solution to this problem.Entities:
Mesh:
Year: 2014 PMID: 25099887 PMCID: PMC4123905 DOI: 10.1371/journal.pone.0103833
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The flow of GHOSTX.
Figure 2Seed search algorithm using suffix arrays.
Figure 3Search algorithm using a suffix array.
Figure 4An example seed search.
Figure 5Conditions for reducing seeds in chain filtering.
Figure 6Search sensitivity of each tool with KEGG GENES.
The vertical axis shows the percentage of correct answers that correspond to the correct answers for each method. The horizontal axis shows the E-value of the alignments.
Computation time with SRS011098 and KEGG GENES (3.9 GB).
| Computation time (sec.) | Acceleration ratio | |
| GHOSTX | 401.9 | 152.6 |
| RAPSearch | 649.5 | 94.4 |
| RAPSearch in fast mode | 91.2 | 672.2 |
| BLAT | 1409.7 | 43.5 |
| BLAST | 61314.1 | 1.0 |
The first, second, and third columns show the name of each program, the computation time, and the acceleration in processing speed relative to BLASTX using 1 thread, respectively.
Computation time with SRR444039 and KEGG GENES (3.9 GB).
| Computation time (sec.) | Acceleration ratio | |
| GHOSTX | 362.7 | 151.8 |
| RAPSearch | 553.2 | 99.5 |
| RAPSearch in fast mode | 64.8 | 849.6 |
| BLAT | 1265.3 | 43.5 |
| BLAST | 55045.0 | 1.0 |
The first, second, and third columns show the name of each program, the computation time, and the acceleration in processing speed relative to BLASTX using 1 thread, respectively.
Computation time with SRS011098 and NCBI nr (14.8 GB).
| Computation time (sec.) | Acceleration ratio | |
| GHOSTX | 1020.1 | 165.2 |
| RAPSearch | 1564.4 | 107.7 |
| RAPSearch in fast mode | 223.8 | 752.8 |
| BLAT | N/A | N/A |
| BLAST | 168488.0 | 1.0 |
The first, second, and third columns show the name of each program, the computation time, and the acceleration in processing speed relative to BLASTX using 1 thread, respectively.
Computation time with SRR444039 and NCBI nr (14.8 GB).
| Computation time (sec.) | Acceleration ratio | |
| GHOSTX | 1003.5 | 130.8 |
| RAPSearch | 1404.1 | 93.4 |
| RAPSearch in fast mode | 223.8 | 586.2 |
| BLAT | N/A | N/A |
| BLAST | 131213.3 | 1.0 |
The first, second, and third columns show the name of each program, the computation time, and the acceleration in processing speed relative to BLASTX using 1 thread, respectively.
Computation time of the preprocessing including indexing with KEGG GENES (3.9 GB) and NCBI nr (14.8 GB).
| Computation time with KEGG GENES (sec.) | Computation time with NCBI nr (sec.) | |
| GHOSTX | 1589.2 | 4415.2 |
| RAPSearch | 1914.2 | 4210.5 |
| BLAST | 637.6 | 1678.9 |
The first, second, and third columns show the name of each program, the computation time with KEGG GENES, and the computation time with NCBI nr.
Figure 7Computation times with multithreading.
Comparison with memory size for KEGG GENES (3.9 GB) of each size of the database chunks.
| Chunk size | Memory size for constructing index (GB) | Memory size for homology search (GB) |
| 512 MB | 4.6 | 4.2 |
| 1 GB | 9.2 | 7.2 |
| 2 GB | 18.2 | 13.3 |
The first, second, and third columns show the size of the database chunk, the used memory size for constructing index (GB), and the used memory size for homology search (GB).
Comparison with Computation time for KEGG GENES (3.9 GB) of each size of the database chunks.
| Chunk size | Computation time (sec.) | Acceleration ratio |
| 512 MB | 526.9 | 0.8 |
| 1 GB | 452.7 | 0.9 |
| 2 GB | 401.9 | 1.0 |
The first, second, and third columns show the size of the database chunk, the computation time, and the acceleration in processing speed relative to GHOSTX with 2 GB database chunks, respectively.