| Literature DB >> 34573438 |
Kazuki Takabatake1, Kazuki Izawa1, Motohiro Akikawa1, Keisuke Yanagisawa1, Masahito Ohue1, Yutaka Akiyama1.
Abstract
Metagenomic analysis, a technique used to comprehensively analyze microorganisms present in the environment, requires performing high-precision homology searches on large amounts of sequencing data, the size of which has increased dramatically with the development of next-generation sequencing. NCBI BLAST is the most widely used software for performing homology searches, but its speed is insufficient for the throughput of current DNA sequencers. In this paper, we propose a new, high-performance homology search algorithm that employs a two-step seed search strategy using multiple reduced amino acid alphabets to identify highly similar subsequences. Additionally, we evaluated the validity of the proposed method against several existing tools. Our method was faster than any other existing program for ≤120,000 queries, while DIAMOND, an existing tool, was the fastest method for >120,000 queries.Entities:
Keywords: genome sequence; homology search; metagenomic analysis; reduced amino acid
Mesh:
Year: 2021 PMID: 34573438 PMCID: PMC8469100 DOI: 10.3390/genes12091455
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Reduced amino acid alphabet generated using the method proposed by Murphy et al. [9].
Figure 2Seed search steps of TSSS.
Figure 3TSSS flowchart.
Options for comparison programs.
| Program | Options |
|---|---|
| BLAST | -outfmt 6 -comp_based_stats 0 -seg no |
| RAPSearch2 fast | -b 0 -t n -a t |
| RAPSearch2 | -b 0 -t n |
| GHOSTZ | -q d -F |
| DIAMOND | -f 6 -e 10 -p 1 --masking 0 --comp-based-stats 0 |
| DIAMOND-sensitive | -f 6 -e 10 -p 1 --sensitive --masking 0 --comp-based-stats 0 |
| DIAMOND-more sensitive | -f 6 -e 10 -p 1 --more-sensitive --masking 0 --comp-based-stats 0 |
Reduced amino acid alphabets used in comparison programs.
| Program | Size of Reduced Amino Acid Alphabet |
|---|---|
| RAPSearch2 | 10 |
| GHOSTZ | 10 |
| DIAMOND | 11 |
TSSS parameter ranges.
| Parameter | Description | Range |
|---|---|---|
| (H1, H2) | Hamming distances allowed for | {(0, 0), (0, 1), (1, 1)} |
| L1 | Length of | {2, 4, 6, 8} |
| A1 | Size of reduced amino acid alphabet for | {6, 8, 10, 12, 14, 16, 18} |
| L2 | Length of | {2, 3, 4, 5} |
| A2 | Size of reduced amino acid alphabet for | {4, 6, 8, 10} |
Figure 4TSSS results.
Figure 5Accuracy and CPU time for each method.
Parameter details of representative TSSS methods.
| Name | H1 | H2 | L1 | A1 | L2 | A2 |
|---|---|---|---|---|---|---|
| Fast | 0 | 1 | 4 | 18 | 5 | 6 |
| Middle | 0 | 1 | 4 | 16 | 4 | 8 |
| Sensitive | 0 | 1 | 2 | 18 | 5 | 8 |
Figure 6Accuracy of each method according to E-value.
Figure 7CPU time according to number of queries for each program.
Execution speed ratio against NCBI BLAST.
| Program | Speed Ratio |
|---|---|
| BLAST | 1.0 |
| RAPSearch2 fast | 890.2 |
| DIAMOND | 6122.7 |
| TSSS fast | 1241.9 |
| RAPSearch2 | 65.2 |
| GHOSTZ | 121.2 |
| TSSS middle | 336.1 |
| DIAMOND-sensitive | 731.4 |
| DIAMOND-more sensitive | 347.0 |
| TSSS sensitive | 148.0 |
Figure 8Speed-up ratio (A) and memory consumption (B) for parallel execution.