| Literature DB >> 32485813 |
Lily He1, Rui Dong1, Rong Lucy He2, Stephen S-T Yau1.
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.Entities:
Keywords: alignment-free; genome comparison; phylogenetic analysis; positional correlation natural vector
Mesh:
Year: 2020 PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1(a) The Neighbor-Joining phylogenetic tree of 82 HCV genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 82 HCV genome sequences based on FFP method (k = 6). (c) The phylogenetic tree of 82 HCV genome sequences based on Bayesian inference method.
Figure 2(a) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on FFP method (k = 5). (c) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on NV method. (d) The phylogenetic tree of 152 HBV genome sequences based on Bayesian inference method.
Figure 3(a) The Neighbor-Joining phylogenetic tree of 330 dengue viruses genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 330 dengue viruses genome sequences based on NV method.
Figure 4(a) The Neighbor-Joining phylogenetic tree of 326 HPV genome sequences based on PCNV method. (b) The phylogenetic tree of 326 HPV genome sequences based on Bayesian inference method.
Running time for PCNV, Bayesian inference, FFP, AFKS, and Muscle methods. “∼”, unable to compute on laptop.
| Method | HCV | HBV | Dengue | HPV | Bacteria |
|---|---|---|---|---|---|
| (82) | (152) | (330) | (326) | (59) | |
| PCNV | 0.33s | 0.27s | 0.66s | 0.78s | 53.71s |
| Bayesian | 1097s | 263s | 217,353s | 217,512s | ∼ |
| inference | |||||
| FFP | 11.11s | 0.38s | 49.40s | 35.00s | larger than |
| (k = 6) | (k = 5) | (k = 6) | (k = 6) | 1 day (k = 11) | |
| AFKS | 70.21s | 29.62s | 429.87s | 413.79s | larger than |
| (k = 5) | (k = 4) | (k = 5) | (k = 5) | 4 day (k = 9) | |
| Muscle | 753s | 155s | 3740s | 4002s | ∼ |
Figure 5(a) The Neighbor-Joining phylogenetic tree of 59 bacteria genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 59 bacteria genome sequences based on FFP method (k = 11).
Sensitivity (Sens), Specificity (Spec), and Accuracy (Acc) measures of classification are reported for the four virus datasets. For each dataset, the Ave. line displays average values for each measure.
| Nu- | Sens | Sens | Sens | Spec | Spec | Spec | Acc | Acc | Acc | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | mber | PCNV | FFP | AFKS | PCNV | FFP | AFKS | PCNV | FFP | AFKS | |
| (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | (%) | |||
| HCV | type1 | 16 | 100 | 62.5 | 50.0 | 100 | 87.9 | 86.4 | 100 | 62.5 | 50.0 |
| type2 | 18 | 100 | 55.6 | 94.4 | 100 | 93.8 | 98.4 | 100 | 55.6 | 94.4 | |
| type3 | 20 | 100 | 80.0 | 90.0 | 100 | 93.5 | 96.8 | 100 | 80.0 | 90.0 | |
| type4 | 12 | 100 | 50.0 | 33.3 | 100 | 97.1 | 90.0 | 100 | 50.0 | 33.3 | |
| type5 | 4 | 100 | 50.0 | 75.0 | 100 | 96.2 | 97.4 | 100 | 50.0 | 75.0 | |
| type6 | 12 | 100 | 50.0 | 83.3 | 100 | 91.4 | 98.6 | 100 | 50.0 | 83.3 | |
| Ave. | 100 | 58.0 | 71.0 | 100 | 93.3 | 94.6 | 100 | 58.0 | 71.0 | ||
| HBV | A | 20 | 100 | 100 | 100 | 100 | 100 | 99.2 | 100 | 100 | 100 |
| B | 15 | 100 | 100 | 40.0 | 100 | 100 | 96.4 | 100 | 100 | 40.0 | |
| C | 20 | 100 | 100 | 70.0 | 100 | 100 | 96.2 | 100 | 100 | 70.0 | |
| D | 13 | 100 | 100 | 76.9 | 100 | 100 | 97.1 | 100 | 100 | 76.9 | |
| E | 30 | 100 | 100 | 90.0 | 100 | 100 | 97.5 | 100 | 100 | 90.0 | |
| F | 22 | 100 | 100 | 72.7 | 100 | 100 | 93.8 | 100 | 100 | 72.7 | |
| G | 17 | 100 | 100 | 94.1 | 100 | 100 | 99.3 | 100 | 100 | 94.1 | |
| H | 15 | 100 | 100 | 80.0 | 100 | 100 | 97.1 | 100 | 100 | 80.0 | |
| Ave. | 100 | 100 | 78.0 | 100 | 100 | 97.1 | 100 | 100 | 78.0 | ||
| Dengue | type1 | 72 | 100 | 100 | 76.4 | 100 | 100 | 93.4 | 100 | 100 | 76.4 |
| type2 | 75 | 100 | 100 | 73.3 | 100 | 100 | 93.3 | 100 | 100 | 73.3 | |
| type3 | 83 | 100 | 100 | 78.3 | 100 | 100 | 92.7 | 100 | 100 | 78.3 | |
| type4 | 100 | 100 | 100 | 87.0 | 100 | 100 | 93.0 | 100 | 100 | 87.0 | |
| Ave. | 100 | 100 | 78.8 | 100 | 100 | 93.1 | 100 | 100 | 78.8 | ||
| HPV | 6 | 24 | 100 | 100 | 75.0 | 100 | 100 | 97.7 | 100 | 100 | 75.0 |
| 11 | 17 | 100 | 100 | 100 | 100 | 100 | 99.7 | 100 | 100 | 100 | |
| 16 | 99 | 100 | 100 | 92.9 | 100 | 100 | 96.5 | 100 | 100 | 92.9 | |
| 18 | 19 | 100 | 100 | 94.7 | 100 | 100 | 100 | 100 | 100 | 94.7 | |
| 31 | 23 | 100 | 100 | 82.6 | 100 | 100 | 99.0 | 100 | 100 | 82.6 | |
| 33 | 22 | 100 | 100 | 86.4 | 100 | 100 | 99.7 | 100 | 100 | 86.4 | |
| 35 | 26 | 100 | 100 | 88.5 | 100 | 100 | 99.0 | 100 | 100 | 88.5 | |
| 45 | 12 | 100 | 100 | 83.3 | 100 | 100 | 99.7 | 100 | 100 | 83.3 | |
| 52 | 22 | 100 | 100 | 81.8 | 100 | 100 | 98.4 | 100 | 100 | 81.8 | |
| 53 | 14 | 100 | 100 | 85.7 | 100 | 100 | 98.7 | 100 | 100 | 85.7 | |
| 58 | 37 | 100 | 100 | 94.6 | 100 | 100 | 99.7 | 100 | 100 | 94.6 | |
| 66 | 11 | 100 | 100 | 90.9 | 100 | 100 | 99.7 | 100 | 100 | 90.9 | |
| Ave. | 100 | 100 | 88.0 | 100 | 100 | 99.0 | 100 | 100 | 88.0 |
Figure 6The Neighbor-Joining phylogenetic tree of eight Yersinia genomes based on PCNV method.
Summary of the datasets HCV, HBV, Dengue, HPV, and Bacteria. The length distribution of each dataset validates that PCNV can work with long sequences.
| Dataset | Number | Min | Median | Mean | Max |
|---|---|---|---|---|---|
| (bp) | (bp) | (bp) | (bp) | ||
| HCV | 82 | 8957 | 9442 | 9427 | 9666 |
| HBV | 152 | 10161 | 10669 | 10606 | 10780 |
| Dengue | 330 | 10,161 | 10,669 | 10,606 | 10,780 |
| HPV | 326 | 7814 | 7,905 | 7895 | 8051 |
| Bacteria | 59 | 846,214 | 4,016,947 | 3,610,938 | 5,966,919 |
The positional distribution of “ACTGGCAAT”.
| Sequence | A | C | T | G | G | C | A | A | T |
|---|---|---|---|---|---|---|---|---|---|
| Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|
|
|
|
|
|
|
|
|
|
|
|
| 0 |
|
|
|
|
|
|
|
|
|
| 0 | 0 |
|
|
|
|
|
|
|
|
| 0 | 0 | 0 |
|
|
|
|
|
|