Literature DB >> 20969878

Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications.

Guosen Xie1, Zhongxi Mo.   

Abstract

In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species.
Copyright © 2010 Elsevier Ltd. All rights reserved.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20969878      PMCID: PMC7126940          DOI: 10.1016/j.jtbi.2010.10.018

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


Introduction

Advances in DNA sequencing technology and DNA databases have greatly facilitated biological research involving DNA sequences. However, it has been acknowledged that information contained in DNA sequences is difficult for humans to comprehend without careful extraction and processing. Many methods have been proposed to characterize DNA sequences, with special efforts given to representing the sequence graphically. Using graphical approaches to study biological problems can provide an intuitive picture or useful insights for helping analyzing complicated relations in these systems, as demonstrated by many previous studies on a series of important biological topics, such as analysis of codon usage (Chou and Zhang, 1992, Zhang and Chou, 1993, Zhang and Chou, 1994), base frequencies in the anti-sense strands (Chou et al., 1996), analysis of DNA and protein sequence (Qi et al., 2007, Wu et al., 2010, Yu et al., 2009), enzyme-catalyzed reactions (Andraos, 2008, Chou, 1980, Chou, 1981, Chou, 1989, Chou et al., 1979, Chou and Forsen, 1980, Chou and Liu, 1981, Lin and Neet, 1990, Zhou and Deng, 1984), protein folding kinetics and folding rates (Chou, 1990, Chou and Shen, 2009, Shen et al., 2009), inhibition kinetics of processive nucleic acid polymerases and nucleases (Chou et al., 1994), and drug metabolism systems (Chou, 2010). Moreover, graphical methods have also been introduced to deal with some other biological and medical related problems (Diao et al., 2007, Gonzalez-Diaz et al., 2009, Munteanu et al., 2009). Recently, the images of cellular (Wolfram, 1984, Wolfram, 2002) automata were also used to represent biological sequences (Xiao et al., 2005a) for predicting protein structural classes (Xiao et al., 2008) and subcellular location (Xiao et al., 2006b), identifying G-protein-coupled receptor functional classes (Xiao et al., 2009), investigating HBV virus gene missense mutation (Xiao et al., 2005b), HBV viral infections (Xiao et al., 2006a), as well as analyzing SARS-cov (Gao et al., 2006, Wang et al., 2005). Graphical representation of DNA sequences was first proposed by Hamori and Ruskin (1983). Gates (1986), Nandy (1994) and Leong and Morgenthaler (1995) developed 2D graphical representations of DNA sequences. These methods are straightforward but are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself and degeneracy generated by the circuit. Randic et al. (2003) developed a novel 2D representation method in which there is no loss of information in transferring a DNA sequence to its mathematical representation. More recently, several other 2D representations have been proposed (Wang and Zhang, 2006; Zhang, 2009; Yao et al., 2008, Zhao et al., 2010). As for the 3D graphical representation, Hamori and Ruskin (1983) developed the H-curve. It can uniquely represent a DNA sequence. Based on the classifications of DNA bases, Zhang et al. (Zhang and Zhang, 1994, Zhang, 1997, Zhang et al., 2003) created the Z-curve to represent DNA sequences, in which the four bases (A, G, T and C) are represented by the four vertexes of the regular tetrahedron, as A(1,1,1), T(–1,–1,1), C(–1,1,–1) and G(1,–1,–1). The Z-curve is a 3D graphical representation and it has clear biological implication. However, as pointed out by Tang et al. (2010), the Z-curve representation has a defeat that it might cause a loop in the resulting spatial curve if the frequencies of the four bases present in the sequence are the same. Randic et al. (2000) presented another 3D graphical representation method, but the limitation in forms of crossing and overlapping of the spatial curve representing a DNA sequence still remains. Recently, more other 3D representations were developed by several authors (Li and Wang, 2004, Liao and Wang, 2004; Yu et al., 2009) to overcome the problem of degeneration in graphical representation. These methods, however, do not seem to possess apparent biological meanings. In this article, we will introduce three novel 3D graphical representations of DNA primary sequences, namely, the RY-curve, the MK-curve and the SW-curve. These curves are derived from three classifications of the four DNA bases A, G, T and C, respectively. It can be proved that the proposed representations are strictly non-degenerate, therefore can avoid potential information loss when transferring a DNA sequence to its representations. Moreover, the coordinates of every node on these 3D curves have clear biological implication. In Section 4, we will present three applications developed based on the proposed representations.

Construction of RY-curve, MK-curve and SW-curve

The four DNA bases (A, G, T and C) can be classified by the following three ways according to their chemical properties: Chemical structures of the bases: R (purine)=A, G/Y (pyrimidines)=T,C; Functional groups of the bases: M (amido)=A, C/K (keto)=G, T. The strength of the hydrogen bonds between paired bases: S(strong)=G, C/W=(weak)A, T. First consider the R/Y classification. In a 3D space, a point or a vector has three components. We assign the following vectors to the four DNA bases: Notice that we restrict the two vectors representing purine bases R=A,G in the x–y plane and two vectors representing pyrimidine bases Y=T,C in the x–z plane (see Fig. 1).
Fig. 1

The vectors representing the four bases according to the R/Y classification. Purine bases R=A,G are limited in x–y plane and pyrimidine bases Y=T,C are limited in x–z plane.

The vectors representing the four bases according to the R/Y classification. Purine bases R=A,G are limited in x–y plane and pyrimidine bases Y=T,C are limited in x–z plane. Given a DNA sequence with n bases, S=s 1 s 2,⋯,s , we look at one base at a time. For the i-th one (i=1, 2,…,n), a corresponding point P (x ,y ,z ) can be determined in the 3D space as follows:where , and represent the x-component, y-component and z-component of the vector corresponding to S , respectively. All n bases on the DNA sequence are examined consecutively, and in the end we will obtain n points: P 1,P 2,…,P in the 3D space. Then, starting from the original point (0, 0, 0), connecting adjacent points, we will obtain a 3D curve, called as the RY-curve. As an example, suppose we have a sequence S=ATGGTCTTG. Applying the proposed method, we get ten points corresponding to the nine bases on the sequence (including original point) to be Connecting these points sequentially, we obtain the RY-curve (see Fig. 2) for this particular DNA sequence.
Fig. 2

The RY-curve representation of the sequence ATGGTCTTG.

The RY-curve representation of the sequence ATGGTCTTG. Now we consider the M/K classification and the S/W classification of the four bases. For the M/K classification, we assign the following vectors to the four bases: Here, we restrict two vectors representing the amino bases M=A, C in the x–y plane and two vectors representing the keto bases K=G, T in the x–z plane. A different way of representing the DNA sequence graphically is thus established. We call the 3D curve generated under this definition the MK-curve. Similarly, for the S/W classification, we assign the following vectors to the four bases: This time the strong hydrogen bases S=A, T are restricted in the x–y plane and the weak hydrogen bases W=G, C are restricted in the x–z plane. We then obtain the third 3D graphical representation of the DNA sequence. 3D curves generated under this definition are called the SW-curve. As an example, in Fig. 3 we plot the RY-curve, MK-curve and SW-curve of human’s exon 1 of beta-globin gene in Table 1.
Fig. 3

The RY-curve, MK-curve and SW-curve of the coding sequences of the first exon of beta-globin gene of human. (A) The RY-curve of the coding sequences of the first exon of beta-globin gene of human. (B) The MK-curve of the coding sequences of the first exon of beta-globin gene of human. (C) The SW-curve of the coding sequences of the first exon of beta-globin gene of human.

Table 1

The coding sequences of the first exon of beta-globin gene of 11 different species.

SpeciesCoding sequences
HumanATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATTAAGTTGGTGGTGAGGCCCTGGGCAG
GoatATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG
GallusATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAAGGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG
MouseATGGTTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAAGGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
RatATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAGGTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG
ChimpanzeeATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG
BovineATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
GorillaATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
OpossumATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAGGTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG
LemurATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAGGTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG
RabbitATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC
The RY-curve, MK-curve and SW-curve of the coding sequences of the first exon of beta-globin gene of human. (A) The RY-curve of the coding sequences of the first exon of beta-globin gene of human. (B) The MK-curve of the coding sequences of the first exon of beta-globin gene of human. (C) The SW-curve of the coding sequences of the first exon of beta-globin gene of human. The coding sequences of the first exon of beta-globin gene of 11 different species.

Properties of RY-curve, MK-curve and SW-curve

In this section, we will prove some properties of RY-curve, MK-curve and SW-curve. We use notations A, G, T and C to denote the content of bases A, T, G and C, respectively, in a DNA sequence: S=s 1 s 2⋯s , s ∈{A,T,G,C}. There is no circuit and degeneracy in RY-curve, MK-curve and SW-curve. We can prove this property by contradiction. First consider the RY-curve. Suppose that there are one or more circuits in a RY-curve. Then there exists at least one point in the 3D space at which the curve crosses itself. That is, two points on the curve, say P and P , i≠j, have exactly the same coordinates (x ,y ,z )=(x ,y ,z ). So we must have x =x . According to the Assignment (1) and Eq. (2), we have and . This implies i=j. However, this contradicts the supposition that i≠j. Therefore, there is no circuit and degeneracy in RY-curve. Similarly, we can show that there is also no circuit and degeneracy in MK-curve and SW-curve. □ There exists an one-to-one correspondence between a DNA sequence and a RY-curve (MK-curve or SW-curve) and no loss of information is resulted. First consider RY-curve. From the previous proof, we know that, for a given DNA sequences S=s 1 s 2⋯s , there exists one unique RY-curve. □ Conversely, suppose that RY-curve of a DNA sequence is given; it then follows immediately that the coordinates of all n nodes on the RY-curve, (x ,y ,z ), i=1,2,…,n are given. Let (x 0,y 0,z 0)=(0,0,0). According to Eq. (2), bases s corresponding to the node P(x ,y ,z ) on the RY-curve can be calculated using the following formula: Formula (5) consists of the followings set of equations:where a ∈{(1,−1,0),(1,1,0),(1,0,1),(1,0,−1)} and i=1,2,…,n is known. Note (x 0,y 0,z 0)=(0,0,0). Regarding (x 1,y 1,z 1))⋯(x ,y ,z ) as unknown, we obtain the coefficient matrix of the Eq. (6) to be The determinant |A|=1≠0, therefore for the given RY-curve, Eqs. (6) have a unique solution. This implies that one RY-curve uniquely determines one correspondent DNA sequence. Hence, the correspondence between DNA sequences and RY-curves is one-to-one and there is no loss of information. Similarly, we can prove that Property 3.2 holds for MK-curve and SW-curve as well. The x-component of the vector corresponding to the node P(x ,y ,z ) of the RY-curve (MK-curve or SW-curve), x , is just the length of the DNA sequence S=s 1 s 2⋯s , we have The proof follows immediately from assignment (1) and Eq. (2). □ For the RY-curve, its projections (2D curve) onto the x–y plane and the x–z plane denote the distributions of purine bases (A,G) and pyrimidine bases (T,C) along the sequence S=s s …,s , respectively, and we have For the MK-curve, its projections (2D curve) onto the x–y plane and the x–z plane denotes the distributions of bases of amino group (A,C) and keto group (G,T) along the sequence S=s s ,…,s respectively, and we have For the SW-curve, its projections (2D curve) onto the x–y plane and the x–z plane denotes the distributions of bases of weak H-bonds (A,T) and strong H-bonds (G,C) along the sequence S=s s ,…,s respectively, and we have First we prove (i). From assignment (1), we know that the vectors representing bases, G and A, are symmetrical about x axis on the x-y plane, and the vectors representing bases, T and C, also are symmetrical about x axis on the x-z plane. So, according to Equation (2), we have The projection of RY-curve onto the x–y plane is a 2D curve with nodes: {(x ,y ), k=1,2,…,n} and the projection of RY-curve onto the x–z plane is a 2D curve with nodes:{(x ,z ), k=1,2,…,n}. So, the projections of RY-curve onto the x–y plane and the x–z plane display the distributions of purine bases (A,G) and pyrimidine bases (T,C) along the sequence, respectively. □ Similarly, we can prove (ii) and (iii). Fig. 4 shows the projections of the RY-curve of human’s exon 1 of beta-globin genes onto the x–y plane and the x–z plane.
Fig. 4

The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane and x–z plane. (A) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane, which denotes the distribution of purine bases (A,G) along the coding sequence. (B) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–z plane, which denotes the distribution of pyrimidine bases (T,C) along the coding sequence.

The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane and x–z plane. (A) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane, which denotes the distribution of purine bases (A,G) along the coding sequence. (B) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–z plane, which denotes the distribution of pyrimidine bases (T,C) along the coding sequence. From Fig. 4, we can see that G>A , T>C, and the changing trend of the content of bases A,G and bases T,C along the sequence is also directly observable. Let (x ,y ,z ) denote the coordinates of the terminal point on an RY-curve, then the following relationships are true: if y >0,z >0, then T +G >A +C . if y <0,z <0,then T +G C . if y >0, z <0,then G +C >A +T if y <0, z >0, then G +C if y =0, z =0, then T +G =A +C and T =C , G =A . The above results follow from assignment (1) and Eq. (2) directly. For MK-curve and SW-curve, there are analogous properties as well. □

The applications of RY-curve, MK-curve and SW-curve

In this section, we will present two applications of RY-curve, MK-curve and SW-curve.

Calculation of the base content of a DNA sequence

Based on Property 3.3, Property 3.4, we can obtain three equations for each curve. For RY-curve, we have For MK-curve, we have And for SW-curve, we have Without loss of generality, we select the following four independent equations from (7), (8), (9): Notice that since the coefficient matrix of Eq. (10) is nonsingular, there exists one unique solution. The solution of Eq. (10) can be obtained recursively as follows: For example, for the complete coding sequences of beta-globin genes of human, from its RY-curve and MK-curve, we obtain Substituting these values into formula (11), we get Similarly, using formula (11), we can calculate the base content of DNA sequences for the eleven species presented in Table 1 (see Table 2).
Table 2

The base contents of the 11 coding sequences of Table 1.

SpeciesAGTCTotal
Human1735211992
Goat1735171786
Gallus1934152492
Mouse1734232094
Rat2033211892
Chimpanzee20412420105
Bovine1735181686
Gorilla1737201993
Opossum2129222092
Lemur1935231592
Rabbit1737201690
The base contents of the 11 coding sequences of Table 1.

Similarity analysis based on the RY-curve, MK-curve and SW-curve

Comparing similarities among different DNA sequences is one of the essential motivations of graphical representation. In order to do this, Randic et al. (2000) proposed E matrix, M/M matrix, L/L matrix and L/L matrix. They used matrix eigenvalues as the sequence descriptors to make comparisons among DNA sequences. This method has been proved to be useful and used by many authors (Randic et al., 2000, Randic et al., 2003, Wang and Zhang, 2006, Li and Wang, 2004). However, when DNA sequence is very long, these matrices can become very large, and the calculation could become very complicated. Yao et al. (2005) used the coordinates of the geometric center of graph as the sequence descriptors to do similarity analysis among different DNA sequences. The method is simple as far as calculation is concerned. However, it does have a potentially serious drawback: when the two DNA sequences under comparison contain the same proportions of the bases A, G, T and C, they may have the same geometric center, although they can be completely different. To overcome this unfavorable drawback, we develop a new method for comparing similarity between two DNA sequences based on our proposed RY-curve, MK-curve and SW-curve. A twelve-component vector that serves as a sequence descriptor is constructed based on the geometrical centers of the representing curves.

Construction of the 12-component sequence descriptor

In Section 2, we have constructed the RY-curve of representing a DNA sequence restricting purine bases R=A,G in the x–y plane and pyrimidine bases Y=T,C in the x–z plane. Conditional on this assumption, there exist four possible ways of assigning the four vectors to the four bases (A, G, T and C): Thus, from assignment (12) and Eq. (2), we could have four different kinds of RY-curve, denoted as RY-curve11, RY-curve12, RY-curve13 and RY-curve14. Note these curves are listed in the same order as they appear in Eq. (2). Analogously, for MK-curve, we can also form four kinds of MK-curve, denoted by MK-curve21, MK-curve22, MK-curve23 and MK-curve24. For SW-curve, we can also obtain four kinds of SW-curve, denoted by SW-curve31, SW-curve32, SW-curve33 and SW-curve34. Therefore, we can have a total of twelve 3D curves representing a DNA sequence. For a given sequence with length n, we have a set of points (x , y , z ), i=1, 2, 3,…, n, from the graphical representation of the sequence. The coordinates of the geometrical center of all the points, denoted by x 0, y 0, and z 0, can be calculated as follows (Yao et al., 2005): Next, we calculate the following index by (13): Using formula (14), we calculate an index vector based on all above twelve 3D curves, denoted by Here, we use the first subscript to denote the particular curve (RY, MK, SW) and use the second subscript to denote the four different ways concerning how the vectors are assigned. The 12-component vector (15) can be used as the sequence descriptors. To ease notational exposition, we rewrite the 12-component vector (15) as follows:

Similarity analysis of the coding sequences of beta-globin gene among different species

Comparison based on sequence descriptors is one method, which has been routinely used in similarity analysis. Here, we use the 12-component vector (15) as the index for comparing different DNA sequences. Suppose that for species i and j, their 12-component vectors areand We introduce two measures to quantify similarity between the two species. They are the Euclidean distance d and the correlation angle θ : The smaller the d and θ are, the more the similar species i and j are. Calculating d and θ for all eleven species presented in Table 1, we obtain two similarity matrices: M1 and M2, where M1=(d )11×11 and M2=(θ )11×11. To combine information from these two matrices together, we compute a weighted sum: M(a)=aM1+(1−a)M2, (0≤a≤1), as the overall similarity matrix of the eleven species. Setting a=1/2, we compute the overall similarity matrix M (1/2) for the eleven species and list the result in Table 3.
Table 3

The similarity matrix of the 11 coding sequences of Table 1: M(1/2).

SpeciesHumanGoatGallusMouseRatChimpanzeeBovineGorillaOpossumLemurRabbit
Human00.07891.14751.4891.09990.00620.07350.04240.64680.52460.3757
Goat01.21891.56661.17190.07360.00570.03670.72030.45090.3034
Gallus00.35090.04781.15291.21421.18490.60761.48761.5201
Mouse00.3971.49511.56171.53050.84991.14051.2874
Rat01.10541.16711.13760.56991.53371.4727
Chimpanzee00.06810.03710.65280.520.371
Bovine00.03140.71590.45630.3086
Gorilla00.68720.4860.3379
Opossum01.04830.9319
Lemur00.1497
Rabbit0
The similarity matrix of the 11 coding sequences of Table 1: M(1/2). It can be observed in Table 3 that the following pairs of species have significantly smaller similarity scores: humanchimpanzee, humangorilla, gorillachimpanzee and bovinegorilla. Gallus (the only nonmammalian species) is greatly dissimilar with the other species except for rat, because all other entries involving gallus are relatively large (see the fourth row of Table 3). It is also observed that opossum has larger similarity scores with other species (see the tenth column of Table 3). As presented in Table 3, humangoat, humanbovine, goatbovine, goatgorilla, chimpanzeebovine and chimpanzeegorilla have smaller entries, so they are only moderately similar. To compare our results with others, we list the currently published results on comparing the similarity of human and other several species in Table 4. As one can see, there is only limited variation among these different methods, therefore these methods are in overall agreement.
Table 4

The degree of similarity of the coding sequences of several species with the coding sequences of human.

SpeciesChimpanzeeGorillaGallusOpossumBovineGoat
Our work, Table 30.00620.04241.14750.64680.07350.0789
Liao and Ding (2006), Table 50.0228930.0259600.01061230.0957650.0486640.052039
Liu et al. (2006), Table 50.01450.00790.24170.28150.07500.1078
Yao et al. (2008), Table 100.004490.004780.029160.029990.013590.01633
Zhang (2009), Table 10.95720.26331.15591.18630.36060.4769
Tang et al. (2010), Table 30.03990.04410.17660.15980.07990.0869
Tang et al. (2010), Table 40.03790.04230.17810.15980.07960.0855
The degree of similarity of the coding sequences of several species with the coding sequences of human.

Similarity analysis of cDNA sequences of beta-globin gene among different species

Based on the method proposed in Section 4.2.1., we compare similarities among cDNA sequences of beta‐globin gene of eight species in Table 5. The results are listed in Table 6.
Table 5

The cDNA sequences of beta-globin gene of 8 species.

SpeciesRelease dateUCSC versionLength (bp)
HumanFeb. 2009hg19/GRCh37444
ChimpanzeeMar. 2006panTro2444
RatNov. 2004rn4444
MouseJuly 2007mm9444
TetraodonMar. 2007tetNig2448
FuguOct. 2004fr2444
Mouse lemurJun. 2003micMur1443
BushbabyDec. 2006otoGar1444
Table 6

The similarity matrix of the cDNA of 8 species in Table 5: M(1/2).

SpeciesHumanChimpanzeeRatMouseTetraodonFuguMouse lemurBushbaby
Human00.0131490.367330.447430.602241.43390.376250.42344
Chimpanzee00.380380.460530.614791.44650.389320.43653
Rat00.0833860.272481.07440.0092780.057991
Mouse00.197410.995610.0742640.025603
Tetraodon00.838180.264650.22101
Fugu01.0661.0195
Mouse lemur00.048804
Bushbaby0
The cDNA sequences of beta-globin gene of 8 species. The similarity matrix of the cDNA of 8 species in Table 5: M(1/2). It can be observed in Table 6 that the following pairs of species have significantly smaller similarity scores: humanchimpanzee, ratmouse and mouse lemur–bushbaby. In fact, the eight species chosen here are four pairs of close relatives in their evolution, namely humanchimpanzee, ratmouse, mouse lemur –bushbaby and tetraodon–fugu. However, we notice that the similarity score of tetraodon–fugu is the smallest in the seventh column of Table 6, but it is much bigger than the other three close relative entries. This problem remains to be further studied.

Conclusion

In this paper, we propose three graphical representations, namely RY-curve, MK-curve and SW-curve, to represent the DNA sequence in a 3D space. We prove that the 3D curves are strictly non-degenerate and there is no loss of information in transferring the DNA sequence to the proposed curves. Compared with other graphical representations, the main advantage of our method is that the 2D projection of RY-curve, MK-curve and SW-curve onto the x–y plane and the x–z plane has clear biological implication. For example, the 2D projection of RY-curve onto the x–y plane denotes the changing trend of the content of A, G (see Fig. 4). The three components of the terminal node of these 3D curves algebraically relate to the content of the bases: A, G, T and C (see Property 3.3, Property 3.4). Therefore, more information is retained by our method compared to other available methods. As the application of the graphical representation, we derive a simple formula to recover the content of the four kinds of bases (A, G, C and T) in a DNA sequence from the proposed curves. The sequence descriptors of 12-component vectors we have constructed enabled us to conduct similarity analysis among the coding sequences of first exon of beta-globin gene of 11 species. Our results are in overall agreement with the results reported in the article (Zhang, 2009, Yao et al., 2008, Tang et al., 2010, Liao and Ding, 2006, Liu et al., 2006) (see Table 4). We also have a good validation of similarities of cDNA sequences of the related ones by our method. Computation involved in implementing the proposed methods is fairly straightforward.
  40 in total

Review 1.  On a 3-D representation of DNA primary sequences.

Authors:  Chun Li; Jun Wang
Journal:  Comb Chem High Throughput Screen       Date:  2004-02       Impact factor: 1.339

2.  Do "antisense proteins" exist?

Authors:  K C Chou; C T Zhang; D W Elrod
Journal:  J Protein Chem       Date:  1996-01

3.  Graphic analysis of codon usage strategy in 1490 human proteins.

Authors:  C T Zhang; K C Chou
Journal:  J Protein Chem       Date:  1993-06

4.  A new schematic method in enzyme kinetics.

Authors:  K C Chou
Journal:  Eur J Biochem       Date:  1980-12

5.  An extension of Chou's graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways.

Authors:  G P Zhou; M H Deng
Journal:  Biochem J       Date:  1984-08-15       Impact factor: 3.857

6.  A probability cellular automaton model for hepatitis B viral infections.

Authors:  Xuan Xiao; Shi-Huang Shao; Kuo-Chen Chou
Journal:  Biochem Biophys Res Commun       Date:  2006-02-08       Impact factor: 3.575

7.  GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes.

Authors:  Xuan Xiao; Pu Wang; Kuo-Chen Chou
Journal:  J Comput Chem       Date:  2009-07-15       Impact factor: 3.376

8.  Using cellular automata images and pseudo amino acid composition to predict protein subcellular location.

Authors:  X Xiao; S Shao; Y Ding; Z Huang; K-C Chou
Journal:  Amino Acids       Date:  2005-07-28       Impact factor: 3.520

9.  A novel fingerprint map for detecting SARS-CoV.

Authors:  Lei Gao; Yong-Sheng Ding; Hua Dai; Shi-Huang Shao; Zhen-De Huang; Kuo-Chen Chou
Journal:  J Pharm Biomed Anal       Date:  2005-11-14       Impact factor: 3.935

10.  The community structure of human cellular signaling network.

Authors:  Yuanbo Diao; Menglong Li; Zinan Feng; Jiajian Yin; Yi Pan
Journal:  J Theor Biol       Date:  2007-04-12       Impact factor: 2.691

View more
  7 in total

1.  Theoretical study of GSK-3α: neural networks QSAR studies for the design of new inhibitors using 2D descriptors.

Authors:  Isela García; Yagamare Fall; Xerardo García-Mera; Francisco Prado-Prado
Journal:  Mol Divers       Date:  2011-07-07       Impact factor: 2.943

2.  Circular Helix-Like Curve: An Effective Tool of Biological Sequence Analysis and Comparison.

Authors:  Yushuang Li; Wenli Xiao
Journal:  Comput Math Methods Med       Date:  2016-06-14       Impact factor: 2.238

3.  Sequence comparison via polar coordinates representation and curve tree.

Authors:  Qi Dai; Xiaodong Guo; Lihua Li
Journal:  J Theor Biol       Date:  2011-10-06       Impact factor: 2.691

4.  Disease embryo development network reveals the relationship between disease genes and embryo development genes.

Authors:  Binsheng Gong; Tao Liu; Xiaoyu Zhang; Xi Chen; Jiang Li; Hongchao Lv; Yi Zou; Xia Li; Shaoqi Rao
Journal:  J Theor Biol       Date:  2011-08-03       Impact factor: 2.691

5.  The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism.

Authors:  Guo-Ping Zhou
Journal:  J Theor Biol       Date:  2011-06-22       Impact factor: 2.691

6.  Similarity Estimation Between DNA Sequences Based on Local Pattern Histograms of Binary Images.

Authors:  Yusei Kobori; Satoshi Mizuta
Journal:  Genomics Proteomics Bioinformatics       Date:  2016-04-27       Impact factor: 7.691

7.  Alignment-free genomic sequence comparison using FCGR and signal processing.

Authors:  Daniel Lichtblau
Journal:  BMC Bioinformatics       Date:  2019-12-30       Impact factor: 3.169

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.