Literature DB >> 24567158

3D-dynamic representation of DNA sequences.

Piotr Wąż1, Dorota Bielińska-Wąż.   

Abstract

A new 3D graphical representation of DNA sequences is introduced. This representation is called 3D-dynamic representation. It is a generalization of the 2D-dynamic dynamic representation. The sequences are represented by sets of "material points" in the 3D space. The resulting 3D-dynamic graphs are treated as rigid bodies. The descriptors characterizing the graphs are analogous to the ones used in the classical dynamics. The classification diagrams derived from this representation are presented and discussed. Due to the third dimension, "the history of the graph" can be recognized graphically because the 3D-dynamic graph does not overlap with itself. Specific parts of the graphs correspond to specific parts of the sequence. This feature is essential for graphical comparisons of the sequences. Numerically, both 2D and 3D approaches are of high quality. In particular, a difference in a single base between two sequences can be identified and correctly described (one can identify which base) by both 2D and 3D methods.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24567158      PMCID: PMC3964303          DOI: 10.1007/s00894-014-2141-8

Source DB:  PubMed          Journal:  J Mol Model        ISSN: 0948-5023            Impact factor:   1.810


Introduction

In modern biomedical sciences methods derived from physics, mathematics, and numerical analysis are frequently applied. Therefore this branch of science is, in fact, interdisciplinary. In particular, the analysis of biological sequences (DNA, RNA, protein) combines interdisciplinary methodology. Powerful methods are graphical representations which allow for both graphical and numerical characterization of the sequences. The sequences are usually very long, and it is not obvious how to represent these objects. The questions how to avoid the degeneracy and how to express the features of the objects both graphically and numerically, result in numerous methods. In the present work, we introduce a new 3D graphical representation method. The proposed method is a 3D generalization of the 2D-dynamic representation of DNA sequences [1]. The 2D-dynamic graphs represent the DNA sequences. They are composed of the “material points” distributed in a 2D-space. Their distribution is determined by the sequence. We proposed the moments of inertia and the coordinates of the centers of mass of the 2D-dynamic graphs for the numerical characterization of the DNA sequences [1]. We also considered the high-order moments of the mass-density distributions based on 2D-dynamic graphs as the descriptors [2]. The mass overlaps and the angles between X axis and the principal axis of inertia are also used for the description of similarity/dissimilarity of the DNA sequences [3]. Both our methods (2D and 3D-dynamic representations) are based on a walk in a space which is one of the common approaches in this field. The 2D graphical representation methods took their origin in visualizations of these walks [4-6]. The approaches based on a walk in a 3D space may be found in [7-11]. The differences between them are due to assigning different basis vectors to particular bases and due to different numerical characterizations of the graphs. Examples of various 3D graphical representation methods may be found in [12-23]. In the present work we model a DNA sequence as a set of “material points” in the 3D space. As a consequence, the sequence is characterized by the dynamical quantities, e.g., moments of inertia, analogously as in 2D-dynamic representations. Therefore we retained the name ‘3D-dynamic representation of DNA sequences’. Using the new model we construct the classification diagrams.

Method

The proposed method is based on the convention of a walk in a 3D space. A base in a sequence is represented by a material point in the 3D space. To each point an abstract mass is assigned. We start the walk in the point with coordinates (0,0). In each step this point is shifted by a unit vector. We represent the bases by the following unit vectors: A = (−1,0,1), G = (1,0,1), C = (0,1,1), and T = (0,−1,1). At the end of the vector we locate a mass m = 1. As a consequence, the 3D-dynamic graph is obtained. It consists of the material points in the 3D space with the unit masses. The distribution of the points in the space is determined by the sequence. The coordinates of the center of mass of the 3D-dynamic graph, in the {X,Y,Z} coordinate system are defined aswhere x , y , z are the coordinates of the mass m . Since m  = 1 for all the points, the total mass of the sequence is N = ∑  m , where N is the length of the sequence. Then, the coordinates of the center of mass of the 3D-dynamic graph may be expressed as The tensor of the moment of inertia is given by the matrixwithwhere x , y , z are the coordinates of m in the Cartesian coordinate system for which the origin has been selected at the center of mass. The eigenvalue problem of the tensor of inertia is defined aswhere I are the eigenvalues and ω –the eigenvectors. The eigenvalues are obtained by solving the third-order secular equation The eigenvectors ω 1, ω 2, ω 3 are orthonormal. Thus, they form a basis for a new coordinate system. The corresponding axes of this new system are denoted Ω 1, Ω 2, Ω 3 and referred to as the principal axes. The eigenvalues I 1, I 2, I 3, are called the principal moments of inertia and are equal to the moments of inertia associated with the rotations around the principal axes. The relative orientation of the new and old coordinate system may be described by the cosines of properly defined angles. Let M 1, M 2, and M 3 denote, respectively, the planes (X,Y), (X,Z), and (Y,Z). Similarly, N 1, N 2, N 3 stand for the planes (Ω 1,Ω 2), (Ω 1,Ω 3), (Ω 2,Ω 3), respectively. For the characterization of the 3D-dynamic graphs we use the cosines of the angles between the planes of the two systems of coordinates: It is also convenient to use square roots of the normalized principal moments of inertia: As the descriptors of the 3D-dynamic graphs we take: The coordinates of the centers of mass of the graphs, The principal moments of inertia of the graphs, The values of C .

Results and discussion

The new approach has been applied to histone H4 coding sequences of different species listed in Table 1 and for alpha globin coding sequences of different species listed in Table 4. The lengths of all histone H4 coding sequences are N = 312 and of all alpha globing coding sequences are N = 429.
Table 1

Coordinates of the centers of mass of the graphs representing histone H4 coding sequences

No.SpeciesGene ID (EMBL) μ x μ y μ z
1chickenM7453326.9534.15156.5
2chickenM7453426.9534.29156.5
3humanM6074912.349.228156.5
4mouseV0075317.8619.25156.5
5ratM2743316.9217.93156.5
6wheatM1227724.9334.84156.5
7maizeM3665928.5925.55156.5
8maizeM1337029.2227.84156.5
9maizeM1337729.4825.68156.5
Table 4

Coordinates of the centers of mass of the graphs representing alpha globing coding sequences

No.SpeciesGene ID (EMBL) μ x μ y μ z
1goatEU93806926.0133.03215.0
2chickenM153792.31233.62215.0
3rhesus monkeyJ0449531.0136.42215.0
4orangutanM1215723.4340.97215.0
5horseM1790223.0238.12215.0
6mouseEF60540715.7916.19215.0
7rabbitM1111312.9436.67215.0
Coordinates of the centers of mass of the graphs representing histone H4 coding sequences Some examples of 3D-dynamic graphs are shown in Fig. 1.
Fig. 1

Examples of 3D-dynamic graphs: No. 3 (M60749, former gene ID HSHISAD) and 6: (M12277, former gene ID TAH4091)–see Table 1

Examples of 3D-dynamic graphs: No. 3 (M60749, former gene ID HSHISAD) and 6: (M12277, former gene ID TAH4091)–see Table 1 Figure 2 shows 2D-dynamic graph for the same sequence (No. 3 in Table 1) as in Fig. 1. 2D-dynamic graphs remove the degeneracy present in the Nandy plots [5]. This degeneracy comes from the so called repetitive walks (walks performed back and forth along the same trace). By the introduction in the 2D-dynamic graphs points with different masses the repetitive walks can be recognized both graphically and numerically (the descriptors depend on masses different than 1). However, the 2D-dynamic graphs still do not retain the history of the sequence. Introducing the third dimension one can avoid self-overlapping of the graph.
Fig. 2

2D-dynamic graph: No. 3 (M60749)

2D-dynamic graph: No. 3 (M60749) Numerically, each graph is characterized by descriptors. The values of the descriptors considered in this work are shown in Tables 1, 2, 3, 4, 5, and 6. Due to the choice of the unit vectors representing the four bases, μ and μ give information about the relative number of particular bases in the sequences, and μ contains information about the lengths of the sequences only. μ and μ shown in Tables 1 and 4 are identical to μ and μ for the 2D-dynamic graphs for the same sequences [1]. New information is contained in other descriptors (Tables 2, 3, 5, and 6). The descriptors are very sensitive: they correctly identify a single-base difference between two sequences. The sequence no. 6 in Table 4 (EF605407) differs by two bases from the sequence (MMAGL1) used in the calculations in [1]. The base T in MMAGL1 is replaced by C in EF605407 on the 132 position in the sequence, and the base A in MMAGL1 is replaced by G in EF605407 on the 366 position in the sequence. As a consequence of the change T to C μ increased, and as a consequence of the change A to G μ increased: μ  = 15.49, μ  = 14.80 for MMAGL1, and μ  = 15.79, μ  = 16.19 for EF605407.
Table 2

Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing histone H4 coding sequences

No. I 1 I 2 I 3 C 11 C 12 C 13
12718517.52717050.55248.77770.9654−0.20120.1657
22721386.82719932.15123.55900.9649−0.18600.1854
32567018.12569325.35702.89010.9933−0.09310.0690
42629277.72630996.74799.73930.9814−0.18980.0291
52641789.42644023.65243.57110.9791−0.16110.1245
62718890.32723552.56553.12380.9650−0.2624−0.0018
72657698.22660580.34894.82950.9760−0.2120−0.0495
82677850.12681309.56696.90650.9725−0.23230.0186
92652990.02655433.35383.55280.9770−0.1951−0.0864
Table 3

Cosines of the angles relative to M 2 and M 3 representing histone H4 coding sequences

No. C 21 C 22 C 23 C 31 C 32 C 33
10.22220.3029−0.92680.13630.93150.3371
20.22450.2180−0.94980.13620.95810.2521
30.08490.1791−0.98020.07890.97940.1858
40.16090.7302−0.66400.10480.65630.7472
50.17670.3678−0.91300.10130.91580.3886
60.24220.8932−0.37880.10100.36510.9255
70.17810.9085−0.37800.12510.36010.9245
80.19200.7530−0.62930.13220.61560.7769
90.17130.9587−0.22710.12710.20710.9700
Table 5

Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing alpha globing coding sequences

No. I 1 I 2 I 3 C 11 C 12 C 13
16868772.76870362.47983.83110.9789−0.19790.0503
26788337.06796077.511107.9030.9846−0.16940.0428
36975843.06978180.67307.59040.9713−0.23620.0266
46948275.66949747.95325.52810.9732−0.2283−0.0271
56893025.96894510.47514.30090.9772−0.1935−0.0875
66730034.26726610.99920.48950.9892−0.1446−0.0226
76886040.96887794.47488.28890.9777−0.20580.0425
Table 6

Cosines of the angles relative to M 2 and M 3 representing alpha globing coding sequences

No. C 21 C 22 C 23 C 31 C 32 C 33
10.17790.7052−0.68630.10030.68080.7256
20.17280.9075−0.38280.02600.38430.9228
30.19890.7466−0.63480.13010.62190.7722
40.20760.9233−0.32310.09880.30880.9460
50.19310.9811−0.01330.0885−0.00390.9961
60.12070.8933−0.43290.08280.42550.9012
70.20010.8502−0.48700.06410.48460.8724
Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing histone H4 coding sequences Cosines of the angles relative to M 2 and M 3 representing histone H4 coding sequences Coordinates of the centers of mass of the graphs representing alpha globing coding sequences Principal moments of inertia of the graphs and cosines of the angles relative to M 1 representing alpha globing coding sequences Cosines of the angles relative to M 2 and M 3 representing alpha globing coding sequences The descriptors have been used for the construction of the classification diagrams shown in Figs. 3, 4, 5, 6, 7, and 8. Figure 3 shows the classification diagram ––. The descriptors representing histone H4 coding sequences are represented in the figure by crosses and alpha globin coding sequences by triangles. The crosses and the triangles are located in a different part of the diagram. In the figure these parts are separated by a plane.
Fig. 3

Classification diagram ––

Fig. 4

Classification diagram C 11–C 12–C 13

Fig. 5

Classification diagram C 21–C 22–C 23

Fig. 6

Classification diagram C 31–C 32–C 33

Fig. 7

Classification diagram ––

Fig. 8

Classification diagram ––

Classification diagram –– Classification diagram C 11–C 12–C 13 Classification diagram C 21–C 22–C 23 Classification diagram C 31–C 32–C 33 Classification diagram –– Classification diagram –– Using the present approach one can also create very detailed classification diagrams (in this case, for histone H4 coding sequences of evolutionary similar organisms). The similarity matrix using the standard Clustal W approach for histone H4 coding sequences we gave in [3] (the similarity values are either larger or equal 78%). The considered sequences are rather similar to each other and it is difficult to find a property which allows to distinguish between different species. In particular a good test of the new methods is finding descriptors for which we observe clusterization of the descriptors representing sequences of evolutionarily similar organisms: plants and vertebrates for histone H4 coding sequences. Most of the descriptors give larger similarity values between the sequences of chicken (No. 1, 2 in Table 1) with the sequences of plants rather than with the ones of vertebrates. Using 2D-dynamic representation we found some properties that in effect give the classification of the sequences representing plants and vertebrates [24]. In the present work, we find more descriptors that give a similar classification. The histone H4 coding sequences of plants are represented by the full squares, and of vertebrates by the empty circles in Figs. 4, 5, 6, 7, and 8. A clusterization of the sequences representing evolutionarily similar organisms is obtained for C ,  i, j = 1, 2, 3 parameters (Figs. 4, 5, and 6) and for the descriptors composed of moments of inertia, coordinates of centers of mass of the graphs, and the coefficients r ,  i = 1, 2, 3 (Figs. 7 and 8). Figure 4 corresponds to i = 1, j = 1, 2, 3, Fig. 5 to i = 2, j = 1, 2, 3, and Fig. 6 to i = 3, j = 1, 2, 3. The descriptors representing the sequences of plants and of vertebrates are located in different parts of the diagrams. In order to visualize the classifications, the clusters of descriptors corresponding to different species have been separated by planes. Summarizing, both approaches (2D and 3D-dynamic representations) are examples of graphical representation methods. Very popular methods based on the alignment of the sequences give rather limited information about similarity/dissimilarity of the sequences. Their degeneracy is relatively high. The same similarity values are obtained if T, C, G, or A bases align. Using graphical representation methods one has a chance to consider different aspects of similarity separately, both graphically and numerically. The computing time of these methods is low. The 3D-dynamic graphs are generalizations of the 2D-dynamic graphs. The descriptors used for the characterization of the graphs are also related to the dynamics. The proposed descriptors of the 3D-dynamic graphs lead to new classifications diagrams for the considered data, analogously as for the 2D-dynamic graphs [24]. Therefore the descriptors proposed for both 2D and 3D-dynamic graphs are good, reliable and sensitive, tools for similarity/dissimilarity analysis of DNA sequences. The 3D-dynamic graphs retain the history of the sequences and this is one of their advantages. The consecutive bases in the sequences are represented by the appropriate parts of the 3D-dynamic graphs (the 3D graph never overlaps with itself). Therefore the future applications of the 3D method both as a graphical and as a numerical tool seem to be promising.
  9 in total

1.  On 3-D graphical representation of DNA primary sequences and their numerical characterization.

Authors:  M Randić; M Vracko; A Nandy; S C Basak
Journal:  J Chem Inf Comput Sci       Date:  2000 Sep-Oct

2.  The Z curve database: a graphic representation of genome sequences.

Authors:  Chun-Ting Zhang; Ren Zhang; Hong-Yu Ou
Journal:  Bioinformatics       Date:  2003-03-22       Impact factor: 6.937

Review 3.  On a 3-D representation of DNA primary sequences.

Authors:  Chun Li; Jun Wang
Journal:  Comb Chem High Throughput Screen       Date:  2004-02       Impact factor: 1.339

4.  C-curve: a novel 3D graphical representation of DNA sequence based on codons.

Authors:  Nafiseh Jafarzadeh; Ali Iranmanesh
Journal:  Math Biosci       Date:  2012-12-13       Impact factor: 2.144

5.  Non-degenerate graphical representation of DNA sequences and its applications to phylogenetic analysis.

Authors:  Yan Yang; Yingying Zhang; Meiduo Jia; Chun Li; Liangyu Meng
Journal:  Comb Chem High Throughput Screen       Date:  2013-09       Impact factor: 1.339

6.  Simpler DNA sequence representations.

Authors:  M A Gates
Journal:  Nature       Date:  1985 Jul 18-24       Impact factor: 49.962

7.  Random walk and gap plots of DNA sequences.

Authors:  P M Leong; S Morgenthaler
Journal:  Comput Appl Biosci       Date:  1995-10

8.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.

Authors:  E Hamori; J Ruskin
Journal:  J Biol Chem       Date:  1983-01-25       Impact factor: 5.157

9.  Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences.

Authors:  Piotr Wąż; Dorota Bielińska-Wąż; Ashesh Nandy
Journal:  J Math Chem       Date:  2013-09-03       Impact factor: 2.357

  9 in total
  5 in total

1.  Novel Method of 3-Dimensional Graphical Representation for Proteins and Its Application.

Authors:  Zhao-Hui Qi; Ke-Cheng Li; Jin-Long Ma; Yu-Hua Yao; Ling-Yun Liu
Journal:  Evol Bioinform Online       Date:  2018-06-12       Impact factor: 1.625

2.  A new method to analyze protein sequence similarity using Dynamic Time Warping.

Authors:  Wenbing Hou; Qiuhui Pan; Qianying Peng; Mingfeng He
Journal:  Genomics       Date:  2016-12-11       Impact factor: 5.736

3.  A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets.

Authors:  Subhram Das; Arijit Das; D K Bhattacharya; D N Tibarewala
Journal:  Genomics       Date:  2020-08-19       Impact factor: 5.736

4.  One novel representation of DNA sequence based on the global and local position information.

Authors:  Zhiyi Mo; Wen Zhu; Yi Sun; Qilin Xiang; Ming Zheng; Min Chen; Zejun Li
Journal:  Sci Rep       Date:  2018-05-15       Impact factor: 4.379

5.  Measuring Similarity among Protein Sequences Using a New Descriptor.

Authors:  Mervat M Abo-Elkhier; Marwa A Abd Elwahaab; Moheb I Abo El Maaty
Journal:  Biomed Res Int       Date:  2019-11-22       Impact factor: 3.411

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.