| Literature DB >> 31024610 |
Rui Dong1, Lily He1, Rong Lucy He2, Stephen S-T Yau1.
Abstract
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.Entities:
Keywords: accumulated natural vector; alignment-free; genomes; inter-nucleotide covariance; phylogenetic analysis
Year: 2019 PMID: 31024610 PMCID: PMC6465635 DOI: 10.3389/fgene.2019.00234
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
The Indicator Functions of the sequence “ATCTAGCT”.
| Position(i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | |
| 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
The Accumulated Indicator Functions of the sequence “ATCTAGCT.”
| Position(i) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | |
| 0 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | |
| 0 | 1 | 1 | 2 | 2 | 2 | 2 | 3 |
Figure 1The phylogenetic UPGMA tree using ANV method on Coronaviruses dataset.
Figure 2(A) The phylogenetic UPGMA tree using FFP (k-mer) method when on Coronavirus dataset. (B) The phylogenetic UPGMA tree using MSA (ClustalW) method on Coronaviruses dataset.
Figure 3The phylogenetic UPGMA tree using ANV method on Influenza A viruses dataset.
Figure 4(A) The phylogenetic UPGMA tree using the k-mer (k = 5) method on Influenza A viruses dataset. (B) The phylogenetic UPGMA tree using MSA (ClustalW) method on Influenza A viruses dataset.
Figure 5The Natural Graph using ANV method on Influenza A virus dataset.
Figure 6(A) The phylogenetic UPGMA tree using ANV method on Ebolaviruses dataset. (B) The phylogenetic UPGMA tree using the traditional Natural Vector (NV) method on Ebolaviruses dataset.
Comparison of ANV and k-mer methods on 351 viruses dataset.
| Family | 94.87% | 71.23% | 25.36% | 16.24% | 72.08% |
| Genus | 83.19% | 65.24% | 21.65% | 12.25% | 65.53% |
| Computing Time (seconds) | 2466.73 | 4179.24 | 8636.13 | 24011.70 | Unable to compute on laptop |
Figure 7(A) The phylogenetic UPGMA tree using ANV method on mammals mtDNA dataset. (B) The phylogenetic UPGMA tree using FFP (k-mer) method when K = 8 on mammals mtDNA dataset.
Description of DNA sequence mutation in simulated tests.
| A_original | 200 point mutations from the randomly generated sequence with length 1,000 bp |
| A1 | 2 random nucleotide substitutions in A |
| A2 | 2 random nucleotide substitutions in A |
| A3 | 5 random nucleotide substitutions in A |
| A4 | 5 random nucleotide substitutions in A |
| A5 | 10 random nucleotide substitutions in A |
| A6 | 10 random nucleotide substitutions in A |
| B_original | 200 point mutations from the randomly generated sequence with length 1,000 bp (different from A_original) |
| B1 | 2 random nucleotide substitutions in B_original |
| B2 | 2 random nucleotide substitutions in B_original |
| B3 | 5 random nucleotide substitutions in B_original |
| B4 | 5 random nucleotide substitutions in B_original |
| B5 | 10 random nucleotide substitutions in B_original |
| B6 | 10 random nucleotide substitutions in B_original |
| B7 | 10 bp Deletion from positions 51:60 in B_original |
| B8 | 10 bp Deletion from positions 601:610 in B_original |
| B9 | 20 bp Insertion at position 51 in B_original |
| B10 | 20 bp Insertion at position 601 in B_original |
| B11 | 50 bp Transposition from position 1 to 50 in B_original |
| B12 | 100 bp Transposition from position 601 to 700 in B_original |
Figure 8The phylogenetic UPGMA tree using Jukes-Cantor pairwise alignment method on simulated dataset.
Figure 9(A) The phylogenetic UPGMA tree using ANV method on simulated dataset (B) The phylogenetic UPGMA tree using FFP (k-mer) method when K = 4 on simulated dataset.
Robinson-Foulds distances between trees by alignment-free methods and the reliable alignment tree.
| distance | 0 | 23 | 27 | 29 | 29 |