| Literature DB >> 28323839 |
Gerardo Mendizabal-Ruiz1, Israel Román-Godínez1, Sulema Torres-Ramos1, Ricardo A Salido-Ruiz1, J Alejandro Morales1.
Abstract
Genomic signal processing (GSP) refers to the use of signal processing for the analysis of genomic data. GSP methods require the transformation or mapping of the genomic data to a numeric representation. To date, several DNA numeric representations (DNR) have been proposed; however, it is not clear what the properties of each DNR are and how the selection of one will affect the results when using a signal processing technique to analyze them. In this paper, we present an experimental study of the characteristics of nine of the most frequently-used DNR. The objective of this paper is to evaluate the behavior of each representation when used to measure the similarity of a given pair of DNA sequences.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28323839 PMCID: PMC5360225 DOI: 10.1371/journal.pone.0173288
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Selected DNA numerical representations.
| Name | Numeric representation | Example for sequence X = [ | |
|---|---|---|---|
| 1 | Integer |
|
|
| 2 | Real |
|
|
| 3 | EIIP |
|
|
| 4 | Atomic Number |
|
|
| 5 | Paired Numeric |
|
|
| 6 | Voss |
|
|
| 7 | Tetrahedron |
|
|
| 8 | Z-Curve |
|
|
| 9 | DNA walk |
|
|
Fig 1Biological species selected for gene RP-S18 similarity comparison.
Fig 2Biological species selected for gene COX1 similarity comparison.
Fig 3Mean Euclidean distance scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).
Angle (in degrees) of the rate of change in the mean Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).
| Angle of the Rate of Change | Score by Percentage of Change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DNR | 2-1 | 4-2 | 8-4 | 16-8 | 32-16 | 1 | 2 | 4 | 8 | 16 | 32 |
| Integer | 88.8 | 86.9 | 80.2 | 69.3 | 42.7 | 601.37 | 648.64 | 685.70 | 708.74 | 729.91 | 744.66 |
| Real | 88.7 | 86.5 | 78.5 | 65.0 | 37.2 | 583.68 | 626.69 | 659.76 | 679.33 | 696.50 | 708.65 |
| EIIP | 45.6 | 25.0 | 10.1 | 5.9 | 2.6 | 12.22 | 13.24 | 14.18 | 14.89 | 15.71 | 16.45 |
| Atomic Number | 89.9 | 89.7 | 89.3 | 89.1 | 88.0 | 4367.13 | 4812.52 | 5258.03 | 5605.61 | 6105.16 | 6568.63 |
| Paired Numeric | 88.5 | 86.5 | 77.8 | 63.8 | 34.6 | 525.94 | 565.33 | 597.74 | 616.22 | 632.44 | 643.48 |
| DNA Walk | 89.9 | 89.8 | 89.7 | 89.6 | 89.5 | 1309.21 | 1772.64 | 2451.74 | 3360.19 | 4453.91 | 6306.96 |
| Voss | 86.6 | 81.7 | 64.1 | 42.5 | 18.4 | 228.48 | 245.30 | 259.07 | 267.32 | 274.65 | 279.95 |
| Tetrahedron | 87.4 | 83.7 | 68.8 | 51.0 | 23.0 | 304.65 | 326.77 | 344.82 | 355.11 | 364.98 | 371.77 |
| Z-Curve | 89.9 | 89.8 | 89.8 | 89.7 | 89.5 | 1500.43 | 2066.82 | 2826.21 | 3836.01 | 5351.83 | 7053.40 |
Fig 4Mean Normalized squared euclidean distance scores for 400 synthetic sequences in each of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).
Angle (in degrees) of the rate of change in the mean normalized squared Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).
A = ×10−2, B = ×10−3.
| Angle of the Rate of Change | Score by Percentage of Change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DNR | 2-1 | 4-2 | 8-4 | 16-8 | 32-16 | 1 | 2 | 4 | 8 | 16 | 32 |
| Integer | 0.6 | 0.3 | 8.8 | 4.1 | 1.5 | 0.07028 | 0.08121 | 0.09058 | 0.09671 | 0.10250 | 0.10663 |
| Real | 2.7 | 1.1 | 0.3 | 0.2 | 5.4 | 0.33235 | 0.37968 | 0.41860 | 0.44303 | 0.46433 | 0.47933 |
| EIIP | 5.0 | 2.5 | 1.1 | 6.4 | 3.0 | 0.00530 | 0.00618 | 0.00707 | 0.00780 | 0.00870 | 0.00954 |
| Atomic Number | 2.5 | 1.4 | 6.1 | 4.7 | 2.3 | 0.00208 | 0.00251 | 0.00300 | 0.00342 | 0.00408 | 0.00473 |
| Paired Numeric | 2.7 | 1.2 | 0.4 | 0.2 | 5.9 | 0.32230 | 0.36861 | 0.41081 | 0.43616 | 0.46088 | 0.47739 |
| DNA Walk | 0.1 | 9.8 | 9.9 | 7.4 | 7.7 | 0.00210 | 0.00390 | 0.00734 | 0.01422 | 0.02453 | 0.04605 |
| Voss | 1.6 | 0.7 | 0.2 | 0.1 | 3.8 | 0.19917 | 0.22771 | 0.25325 | 0.26935 | 0.28420 | 0.29469 |
| Tetrahedron | 2.6 | 1.2 | 0.3 | 0.2 | 5.9 | 0.32567 | 0.37133 | 0.41189 | 0.43632 | 0.46108 | 0.47756 |
| Z-Curve | 0.2 | 0.2 | 0.2 | 0.2 | 9.8 | 0.00506 | 0.00917 | 0.01684 | 0.02940 | 0.05263 | 0.07996 |
Fig 5Mean Manhattan distance scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).
Angle (in degrees) of the rate of change in the mean Manhattan distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).
| Angle of the Rate of Change | Score by Percentage of Change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DNR | 2-1 | 4-2 | 8-4 | 16-8 | 32-16 | 1 | 2 | 4 | 8 | 16 | 32 |
| Integer | 90.0 | 89.9 | 89.7 | 89.3 | 88.0 | 14545.7 | 15922.5 | 17065.4 | 17796.3 | 18440.1 | 18891.6 |
| Real | 90.0 | 89.9 | 89.6 | 89.2 | 87.5 | 14127.3 | 15379.3 | 16405.9 | 17043.9 | 17589.7 | 17949.1 |
| EIIP | 88.1 | 85.7 | 77.5 | 66.1 | 40.2 | 302.3 | 331.9 | 358.3 | 376.4 | 394.5 | 408.0 |
| Atomic Number | 90.0 | 90.0 | 90.0 | 89.9 | 89.9 | 109659.5 | 121566.4 | 132437.8 | 139022.9 | 147648.2 | 154013.4 |
| Paired Numeric | 89.9 | 89.9 | 89.6 | 89.1 | 87.3 | 12759.7 | 13901.9 | 14878.3 | 15468.6 | 15958.7 | 16297.6 |
| DNA Walk | 90.0 | 90.0 | 89.9 | 89.9 | 89.9 | 19099.4 | 23006.4 | 27651.7 | 32232.5 | 37815.1 | 45259.6 |
| Voss | 89.9 | 89.7 | 89.1 | 88.0 | 84.2 | 5532.5 | 6023.3 | 6443.3 | 6704.3 | 6939.2 | 7097.1 |
| Tetrahedron | 89.9 | 89.8 | 89.3 | 88.5 | 85.5 | 7375.7 | 8020.8 | 8575.5 | 8902.9 | 9217.5 | 9421.4 |
| Z-Curve | 90.0 | 90.0 | 90.0 | 89.9 | 89.9 | 18633.5 | 22596.3 | 27166.0 | 31891.5 | 38082.2 | 44335.7 |
Fig 6Mean 1-Correlation scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNRs.
i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution). Note that the range for each box is not between [0, 1], instead they vary in order to present a better visualization.
Angle (in degrees) of the rate of change in the mean correlation coefficient scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).
A = ×10−2, B = ×10−3.
| Angle of the Rate of Change | Score by Percentage of Change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DNR | 2-1 | 4-2 | 8-4 | 16-8 | 32-16 | 1 | 2 | 4 | 8 | 16 | 32 |
| Integer | 1.3 | 0.5 | 0.2 | 8.3 | 3.0 | 0.14053 | 0.16240 | 0.18112 | 0.19337 | 0.20493 | 0.21317 |
| Real | 5.4 | 2.2 | 0.7 | 0.3 | 0.1 | 0.66464 | 0.75931 | 0.83715 | 0.88604 | 0.92864 | 0.95865 |
| EIIP | 0.1 | 5.1 | 2.1 | 1.3 | 6.0 | 0.01060 | 0.01236 | 0.01413 | 0.01560 | 0.01739 | 0.01907 |
| Atomic Number | 5.0 | 2.8 | 1.2 | 9.5 | 4.7 | 0.00416 | 0.00503 | 0.00599 | 0.00684 | 0.00816 | 0.00946 |
| Paired Numeric | 5.3 | 2.4 | 0.7 | 0.4 | 0.1 | 0.64453 | 0.73715 | 0.82156 | 0.87229 | 0.92173 | 0.95477 |
| DNA Walk | 8.9 | 8.0 | 9.7 | 8.5 | 8.5 | 0.00186 | 0.00341 | 0.00621 | 0.01300 | 0.02488 | 0.04852 |
| Voss | 3.3 | 1.5 | 0.5 | 0.2 | 7.5 | 0.39826 | 0.45532 | 0.50639 | 0.53857 | 0.56826 | 0.58921 |
| Tetrahedron | 5.2 | 2.3 | 0.7 | 0.4 | 0.1 | 0.65123 | 0.74256 | 0.82370 | 0.87258 | 0.92212 | 0.95510 |
| Z-Curve | 0.3 | 0.2 | 0.2 | 0.2 | 0.1 | 0.00691 | 0.01187 | 0.02045 | 0.03559 | 0.06183 | 0.09648 |
Fig 7Mean variance of the frequency components according to the percentage of change for the selected DNRs using a color palette where red and blue represent high and low variances, respectively.
HF stands for high and and LF for low frequencies.
Complementary sequence scores for each DNR.
EC stands for Euclidean distance, CC for correlation coefficient, NE for normalized Euclidean distance, and MD for Manhattan distance.
| Complementary | R. complementary | |||||||
|---|---|---|---|---|---|---|---|---|
| DNR | ED | NE | MD | CC | ED | NE | MD | CC |
| Integer | 636.92 | 0.075 | 1.63×104 | 0.84 | 636.92 | 0.075 | 1.63×104 | 0.84 |
| Real | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| EIIP | 15.68 | 0.008 | 391.46 | 0.98 | 15.68 | 0.008 | 391.46 | 0.98 |
| Atomic number | 4.69×103 | 0.002 | 1.02×105 | 0.99 | 4.69×103 | 0.002 | 1.02×105 | 0.99 |
| Paired numeric | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| Voss | 277.12 | 0.289 | 7.07×103 | 0.42 | 277.12 | 0.289 | 7.07×103 | 0.42 |
| Tetrahedron | 245.93 | 0.316 | 6.25×103 | 0.36 | 245.93 | 0.316 | 6.25×103 | 0.36 |
| Z-Curve | 0 | 0 | 0 | 1 | 8.98×103 | 0.081 | 3.62×104 | 0.8925 |
| DNA Walk | 0 | 0 | 0 | 1 | 1.42×104 | 0.113 | 4.74×104 | 0.8837 |
Fig 8Biological experiment results for the similarity computation of the selected gene RP-S18 sequences with respect to H. sapiens (left column) and S. Cerevisiae (right column) when using the four selected similarity metrics.
Fig 9Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the Euclidean distance as the similarity metric.
Fig 12Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the correlation coefficient as the similarity metric.