| Literature DB >> 36035628 |
Machbah Uddin1,2, Mohammad Khairul Islam1, Md Rakib Hassan2, Farah Jahan1, Joong Hwan Baek3.
Abstract
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D k - m e r count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for k - m e r . We develop an efficient system for finding the positions of k - m e r in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.Entities:
Keywords:
AFproject; Benchmark dataset; Bioinformatics engineering; DNA sequence similarity; Dynamic k -
Year: 2022 PMID: 36035628 PMCID: PMC9395857 DOI: 10.1007/s40747-022-00846-y
Source DB: PubMed Journal: Complex Intell Systems ISSN: 2199-4536
Fig. 1Overview of our proposed DNA sequence similarity identification model
Fig. 2Two-dimensional count matrix where each cell represent a number of count of a specific subset in the whole string a has 4 count cells for the subset with one base, b contains 16 count cells for subset comprising of two bases, c has 64 count cells where each subset is comprised of 3 bases or a codon, d general expansion formula for k length subset or . Here, four red color cells in (b) indicate that it is expanded from one red cell in (a), again, 16 red cells in (c) indicate that it is expanded from red cells of (b)
Description of 25 cichlid fish genome sequences
| SL | Description | Accession | Seq. Length |
|---|---|---|---|
| 1 | Tropheus duboisi | 009063 | 16,747 |
| 2 | Tropheus moorii | 018814 | 16,826 |
| 3 | Petrochromis trewavasae | 018815 | 16,828 |
| 4 | Neolamprologus brichardi | 009062 | 16,823 |
| 5 | Oreochromis aureus | 013750 | 16,867 |
| 6 | Oreochromis niloticus | 013663 | 16,866 |
| 7 | Oreochromis sp. KM_2006 | 009057 | 16,865 |
| 8 | Tanganyika Tylochromis polylepis | 011171 | 17,118 |
| 9 | Hypselecara temporalis | 011168 | 16,782 |
| 10 | Astronotus ocellatus | 009058 | 16,807 |
| 11 | Ptychochromoides katria | 011169 | 16,794 |
| 12 | Paratilapia polleni | 011170 | 16,760 |
| 13 | Paretroplus maculatus | 011177 | 16,723 |
| 14 | Etroplus maculatus | 011179 | 16,693 |
| 15 | Abudefduf vaigiensis | 009064 | 16,943 |
| 16 | Amphiprion ocellaris | 009065 | 16,888 |
| 17 | Cymatogaster aggregata | 009059 | 16,771 |
| 18 | Ditrema temminckii | 009060 | 16,810 |
| 19 | Pseudolabrus eoethinus | 012055 | 16,745 |
| 20 | Pseudolabrus sieboldi | 009067 | 16,747 |
| 21 | Pteragogus flagellifer | 010205 | 17,034 |
| 22 | Halichoeres melanurus | 009066 | 17,039 |
| 23 | Parajulis poecilepterus | 009459 | 16,896 |
| 24 | Alepocephalus agassizii | 013564 | 16,677 |
| 25 | Bajacalifornia megalops | 013577 | 17,290 |
Description of 8 Yersinia strains
| SL | Description | Accession | Seq. Length |
|---|---|---|---|
| 1 | Y. pestis Antiqua | CP000308 | 4,702,289 |
| 2 | Y. pestis Nepal516 | CP000305 | 4,534,590 |
| 3 | Y. pestis F_15-70 | NC009381 | 4,517,345 |
| 4 | Y. pestis CO92 | AL590842 | 4,653,728 |
| 5 | Y. pestis KIM | AE009952 | 4,600,755 |
| 6 | Y. pestis 91001 | AE017042 | 4,595,065 |
| 7 | Y. pestis pseudotuberculosis IP32954 | BX936398 | 4,744,671 |
| 8 | Y. pestis pseudotuberculosis IP31758 | AAKT 02000001 | 4,721,828 |
RF distances for different distance method and phylogenetic tree generation method for Fish dataset in Table 1 using and
| Distance method | Seqlinkage | Seqneighjoin | ||||||
|---|---|---|---|---|---|---|---|---|
| Average | Single | Complete | Weighted | Centroid | Median | Equivar | Firstorder | |
| Euclidean | 18 | 22 | 20 | 14 | 34 | 34 | 6 | |
| Squaredeuclidean | 18 | 22 | 20 | 14 | 34 | 32 | 4 | 8 |
| Seuclidean | 44 | 44 | 44 | 44 | 44 | 44 | 8 | 8 |
| Cityblock | 8 | 16 | 8 | 10 | 28 | 26 | 4 | 4 |
| Minkowski | 18 | 20 | 20 | 14 | 34 | 34 | 6 | |
| Chebychev | 38 | 40 | 38 | 38 | 40 | 40 | 36 | 36 |
| Cosine | 8 | 6 | 16 | 16 | 20 | 18 | ||
| Correlation | 14 | 20 | 16 | 16 | 32 | 30 | 10 | |
| Hamming | 8 | 10 | 10 | 10 | 10 | 30 | 4 | 4 |
| Jaccard | 8 | 16 | 8 | 8 | 30 | 24 | 4 | 4 |
| Spearman | 6 | 14 | 8 | 8 | 26 | 24 | 4 | 4 |
Note: First column indicates the methods used for PD calculation from feature vectors, columns 2–9 represent the RF distance value achieved by differentphylogenetic tree generation techniques: columns 2–7: methods are under seqlinkage, columns 8–9: under seqneighjoin technique. Here, (*) indicates top result
k and selection using four datasets
| Dataset | RF distance for Different | |||||
|---|---|---|---|---|---|---|
| Fish (Table | 8 | 2 | 8 | 12 | 12 | |
| 9 | 2 | 2 | 4 | 6 | 10 | |
| 10 | 2 | 2 | 4 | 8 | 12 | |
| Yersinia (Table | 8 | 2 | 2 | 2 | 6 | 10 |
| 9 | 0 | 2 | 2 | 6 | ||
| 10 | 0 | 0 | 2 | 4 | 6 | |
| 16 S Ribosomal | 8 | 0 | 4 | 6 | 12 | |
| 9 | 0 | 0 | 4 | 10 | 16 | |
| 10 | 0 | 0 | 4 | 10 | 18 | |
| 18 Eutherian | 8 | 0 | 0 | 2 | 10 | |
| 9 | 0 | 0 | 2 | 6 | 6 | |
| 10 | 0 | 0 | 2 | 4 | 10 | |
Note: Column 1 tells about the dataset used in the experiment, column 2 indicates the k value used for 2D matrix generation, columns 3–7 show RF distance obtained using different . Here, (*) sign indicates the best combinations of k and
RF distances for different distance methods and phylogenetic tree generation methods for Yersinia dataset in Table 2 using and
| Distance method | Seqlinkage | Seqneighjoin | ||||||
|---|---|---|---|---|---|---|---|---|
| Average | Single | Complete | Weighted | Centroid | Median | Equivar | Firstorder | |
| Euclidean | 6 | 4 | 6 | 6 | 8 | 8 | 2 | 2 |
| Squaredeuclidean | 6 | 4 | 6 | 6 | 6 | 6 | ||
| Seuclidean | 10 | 10 | 10 | 10 | 10 | 10 | ||
| Cityblock | 6 | 4 | 6 | 6 | 8 | 8 | 2 | |
| Minkowski | 6 | 4 | 6 | 6 | 8 | 8 | 2 | 2 |
| Chebychev | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| Cosine | 6 | 4 | 6 | 6 | 6 | 6 | ||
| Correlation | 6 | 4 | 6 | 6 | 6 | 6 | ||
| Hamming | 4 | 4 | 6 | 4 | 8 | 8 | 4 | 4 |
| Jaccard | 4 | 4 | 6 | 4 | 8 | 8 | 4 | 4 |
| Spearman | 4 | 4 | 6 | 4 | 6 | 6 | ||
Note: First column indicate methods used for PD calculation from feature vectors, columns 2–9 represent RF distance achieved by different phylogenetic tree generation techniques. Columns 2–7 show distance for seqlinkage technique, columns 8–9 for seqneighjoin technique. Here, (*) indicates top results
Benchmark test result for 25 complete mitochondrial DNA sequences of cichlid fishes dataset in AFproject test platform
| Rank | Method | RF | Accuracy |
|---|---|---|---|
| 1 | 8KMERHist+LBP | 2.00 | 95 |
| 1 | AFKS–d2_star | 2.00 | 95 |
| 1 | AFKS–d2z | 2.00 | 95 |
| 1 | AFKS–euclidean_z | 2.00 | 95 |
| 1 | AFKS–n2r | 2.00 | 95 |
Here, we present top 5 methods among around 100 methods. Bold and (*) sign represents the performance of our method
Fig. 3Phylogenetic tree of 25 fish genome sequences described in Table 1. using our proposed method with and
Benchmark test result for 8 Yersinia strains dataset in AFproject test platform
| Rank | Method | RF | Accuracy |
|---|---|---|---|
| 1 | 3 M-S64-(K)Mer | 0.00 | 100 |
| 1 | AFKS–canberra | 0.00 | 100 |
| 1 | AFKS–chi_squared | 0.00 | 100 |
| 1 | AFKS–d2_star | 0.00 | 100 |
| 1 | AFKS–d2s | 0.00 | 100 |
Here, we present top 5 methods among 80 methods. Bold and (*) sign represents the performance our method
Fig. 4Phylogenetic tree of 8 Yersinia genome sequences described in Table 2 using our proposed method with and
Fig. 5Phylogenetic tree of 16 S Ribosomal DNA sequences of 13 bacteria using our proposed method with and
DNA similarity identification accuracy comparison for 16 S Ribosomal DNA dataset
| Method | Param 1 | Param 2 | Accuracy |
|---|---|---|---|
| Proposed method | |||
| Delibaş et al. [ | 91 |
Here, Column 1 represents the methods, Columns 2 and 3 list most important two parameters, and last column represents the performance achieved by each method. Bold and (*) sign indicates the best result
Fig. 6Phylogenetic tree of 18 Eutherian mammals using our proposed method with and
DNA Similarity identification accuracy comparison for 18 Eutherian mammals mitochondrial DNA dataset
| Method | Param 1 | Param 2 | Accuracy |
|---|---|---|---|
| Proposed method | |||
| Delibaş et al. [ | 81 |
Here, Column 1 represents the methods, Columns 2 and 3 list most important two parameters in each method, and the last column represents the performance achieved by each method. Bold and (*) sign indicates the best result
Step-wise time complexity calculation for our proposed method
| Step | Method | Time complexity |
|---|---|---|
| Step 1 | Dynamic | |
| Step 2 | 2D |
|
| Step 3 | Matrix shrinking |
|
| Step 4 | 1D feature descriptor |
|
| Step 5 | Distance and phylogenetic tree |
|
| Final complexity |
|
Time complexity and memory space consumption comparison with existing works
| Dataset | Method | Time in seconds | Memory in MB |
|---|---|---|---|
| 16 S Ribosomal | Our proposed |
| 9.0742 |
| Delibaş et al. [ | 0.1461 | – | |
| 18 Eutherian | Our proposed |
| 16.0156 |
| Delibaş et al. [ | 16.2565 | – | |
| HIV-1 | Our proposed |
|
|
| Ni et al. [ | 3600.00 | 208.00 | |
| HEV | Our proposed |
|
|
| Ni et al. [ | 7200.00 | 205.00 | |
| Fish | Our proposed | 1.032676 | 16.0469 |
| Yersinia | Our proposed | 81.321820 | 75.0531 |
Column 1 represents name of dataset, Column 2 expresses the method applied on dataset, Column 3 indicates time consumption and Column 4 shows memory consumption. Here, Bold and (*) indicates comparative best result
Impact analysis of proposed shrinking algorithm in terms of time complexity
| Dataset | k-mer value | Required time without shrinking | Required time with shrinking | |||
|---|---|---|---|---|---|---|
| Sr=4 | Sr=16 | Sr=64 | Sr=256 | |||
| 16 S Ribosomal | 8 | 0.145714 | 0.092079 | 0.007845 | 0.002434 | 0.000945 |
| 18 Eutherian | 8 | 1.765345 | 0.741836 | 0.069044 | 0.021210 | 0.008604 |
| HIV-1 | 8 | 1.284576 | 0.541657 | 0.012567 | 0.009631 | 0.003608 |
| HEV | 8 | 11.65471 | 1.101700 | 0.123601 | 0.087569 | 0.014063 |
| Fish | 8 | 10.63547 | 1.032676 | 0.102035 | 0.060645 | 0.010249 |
| Yersinia | 9 | 963.6387 | 81.32182 | 9.320472 | 4.403627 | 0.990544 |