Guosen Xie1, Zhongxi Mo. 1. School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China.
Abstract
In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species.
In this article, we introduce three 3D graphical representations of DNA primary sequences, which we call RY-curve, MK-curve and SW-curve, based on three classifications of the DNA bases. The advantages of our representations are that (i) these 3D curves are strictly non-degenerate and there is no loss of information when transferring a DNA sequence to its mathematical representation and (ii) the coordinates of every node on these 3D curves have clear biological implication. Two applications of these 3D curves are presented: (a) a simple formula is derived to calculate the content of the four bases (A, G, C and T) from the coordinates of nodes on the curves; and (b) a 12-component characteristic vector is constructed to compare similarity among DNA sequences from different species based on the geometrical centers of the 3D curves. As examples, we examine similarity among the coding sequences of the first exon of beta-globin gene from eleven species and validate similarity of cDNA sequences of beta-globin gene from eight species.
Advances in DNA sequencing technology and DNA databases have greatly facilitated biological research involving DNA sequences. However, it has been acknowledged that information contained in DNA sequences is difficult for humans to comprehend without careful extraction and processing. Many methods have been proposed to characterize DNA sequences, with special efforts given to representing the sequence graphically. Using graphical approaches to study biological problems can provide an intuitive picture or useful insights for helping analyzing complicated relations in these systems, as demonstrated by many previous studies on a series of important biological topics, such as analysis of codon usage (Chou and Zhang, 1992, Zhang and Chou, 1993, Zhang and Chou, 1994), base frequencies in the anti-sense strands (Chou et al., 1996), analysis of DNA and protein sequence (Qi et al., 2007, Wu et al., 2010, Yu et al., 2009), enzyme-catalyzed reactions (Andraos, 2008, Chou, 1980, Chou, 1981, Chou, 1989, Chou et al., 1979, Chou and Forsen, 1980, Chou and Liu, 1981, Lin and Neet, 1990, Zhou and Deng, 1984), protein folding kinetics and folding rates (Chou, 1990, Chou and Shen, 2009, Shen et al., 2009), inhibition kinetics of processive nucleic acid polymerases and nucleases (Chou et al., 1994), and drug metabolism systems (Chou, 2010). Moreover, graphical methods have also been introduced to deal with some other biological and medical related problems (Diao et al., 2007, Gonzalez-Diaz et al., 2009, Munteanu et al., 2009). Recently, the images of cellular (Wolfram, 1984, Wolfram, 2002) automata were also used to represent biological sequences (Xiao et al., 2005a) for predicting protein structural classes (Xiao et al., 2008) and subcellular location (Xiao et al., 2006b), identifying G-protein-coupled receptor functional classes (Xiao et al., 2009), investigating HBV virus gene missense mutation (Xiao et al., 2005b), HBV viral infections (Xiao et al., 2006a), as well as analyzing SARS-cov (Gao et al., 2006, Wang et al., 2005).Graphical representation of DNA sequences was first proposed by Hamori and Ruskin (1983). Gates (1986), Nandy (1994) and Leong and Morgenthaler (1995) developed 2D graphical representations of DNA sequences. These methods are straightforward but are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself and degeneracy generated by the circuit. Randic et al. (2003) developed a novel 2D representation method in which there is no loss of information in transferring a DNA sequence to its mathematical representation. More recently, several other 2D representations have been proposed (Wang and Zhang, 2006; Zhang, 2009; Yao et al., 2008, Zhao et al., 2010). As for the 3D graphical representation, Hamori and Ruskin (1983) developed the H-curve. It can uniquely represent a DNA sequence. Based on the classifications of DNA bases, Zhang et al. (Zhang and Zhang, 1994, Zhang, 1997, Zhang et al., 2003) created the Z-curve to represent DNA sequences, in which the four bases (A, G, T and C) are represented by the four vertexes of the regular tetrahedron, as A(1,1,1), T(–1,–1,1), C(–1,1,–1) and G(1,–1,–1). The Z-curve is a 3D graphical representation and it has clear biological implication. However, as pointed out by Tang et al. (2010), the Z-curve representation has a defeat that it might cause a loop in the resulting spatial curve if the frequencies of the four bases present in the sequence are the same. Randic et al. (2000) presented another 3D graphical representation method, but the limitation in forms of crossing and overlapping of the spatial curve representing a DNA sequence still remains. Recently, more other 3D representations were developed by several authors (Li and Wang, 2004, Liao and Wang, 2004; Yu et al., 2009) to overcome the problem of degeneration in graphical representation. These methods, however, do not seem to possess apparent biological meanings.In this article, we will introduce three novel 3D graphical representations of DNA primary sequences, namely, the RY-curve, the MK-curve and the SW-curve. These curves are derived from three classifications of the four DNA bases A, G, T and C, respectively. It can be proved that the proposed representations are strictly non-degenerate, therefore can avoid potential information loss when transferring a DNA sequence to its representations. Moreover, the coordinates of every node on these 3D curves have clear biological implication. In Section 4, we will present three applications developed based on the proposed representations.
Construction of RY-curve, MK-curve and SW-curve
The four DNA bases (A, G, T and C) can be classified by the following three ways according to their chemical properties:Chemical structures of the bases: R (purine)=A, G/Y (pyrimidines)=T,C;Functional groups of the bases: M (amido)=A, C/K (keto)=G, T.The strength of the hydrogen bonds between paired bases: S(strong)=G, C/W=(weak)A, T.First consider the R/Y classification. In a 3D space, a point or a vector has three components. We assign the following vectors to the four DNA bases:Notice that we restrict the two vectors representing purine bases R=A,G in the x–y plane and two vectors representing pyrimidine bases Y=T,C in the x–z plane (see
Fig. 1).
Fig. 1
The vectors representing the four bases according to the R/Y classification. Purine bases R=A,G are limited in x–y plane and pyrimidine bases Y=T,C are limited in x–z plane.
The vectors representing the four bases according to the R/Y classification. Purine bases R=A,G are limited in x–y plane and pyrimidine bases Y=T,C are limited in x–z plane.Given a DNA sequence with n bases, S=s
1
s
2,⋯,s
, we look at one base at a time. For the i-th one (i=1, 2,…,n), a corresponding point P
(x
,y
,z
) can be determined in the 3D space as follows:where , and represent the x-component, y-component and z-component of the vector corresponding to S
, respectively. All n bases on the DNA sequence are examined consecutively, and in the end we will obtain n points: P
1,P
2,…,P
in the 3D space. Then, starting from the original point (0, 0, 0), connecting adjacent points, we will obtain a 3D curve, called as the RY-curve.As an example, suppose we have a sequence S=ATGGTCTTG. Applying the proposed method, we get ten points corresponding to the nine bases on the sequence (including original point) to beConnecting these points sequentially, we obtain the RY-curve (see
Fig. 2) for this particular DNA sequence.
Fig. 2
The RY-curve representation of the sequence ATGGTCTTG.
The RY-curve representation of the sequence ATGGTCTTG.Now we consider the M/K classification and the S/W classification of the four bases.For the M/K classification, we assign the following vectors to the four bases:Here, we restrict two vectors representing the amino bases M=A, C in the x–y plane and two vectors representing the keto bases K=G, T in the x–z plane. A different way of representing the DNA sequence graphically is thus established. We call the 3D curve generated under this definition the MK-curve.Similarly, for the S/W classification, we assign the following vectors to the four bases:This time the strong hydrogen bases S=A, T are restricted in the x–y plane and the weak hydrogen bases W=G, C are restricted in the x–z plane. We then obtain the third 3D graphical representation of the DNA sequence. 3D curves generated under this definition are called the SW-curve.As an example, in
Fig. 3 we plot the RY-curve, MK-curve and SW-curve of human’s exon 1 of beta-globin gene in
Table 1.
Fig. 3
The RY-curve, MK-curve and SW-curve of the coding sequences of the first exon of beta-globin gene of human. (A) The RY-curve of the coding sequences of the first exon of beta-globin gene of human. (B) The MK-curve of the coding sequences of the first exon of beta-globin gene of human. (C) The SW-curve of the coding sequences of the first exon of beta-globin gene of human.
Table 1
The coding sequences of the first exon of beta-globin gene of 11 different species.
The RY-curve, MK-curve and SW-curve of the coding sequences of the first exon of beta-globin gene of human. (A) The RY-curve of the coding sequences of the first exon of beta-globin gene of human. (B) The MK-curve of the coding sequences of the first exon of beta-globin gene of human. (C) The SW-curve of the coding sequences of the first exon of beta-globin gene of human.The coding sequences of the first exon of beta-globin gene of 11 different species.
Properties of RY-curve, MK-curve and SW-curve
In this section, we will prove some properties of RY-curve, MK-curve and SW-curve. We use notations A, G, T and C to denote the content of bases A, T, G and C, respectively, in a DNA sequence: S=s
1
s
2⋯s
, s
∈{A,T,G,C}.There is no circuit and degeneracy in RY-curve, MK-curve and SW-curve.We can prove this property by contradiction. First consider the RY-curve. Suppose that there are one or more circuits in a RY-curve. Then there exists at least one point in the 3D space at which the curve crosses itself. That is, two points on the curve, say P
and P
, i≠j, have exactly the same coordinates (x
,y
,z
)=(x
,y
,z
). So we must have x
=x
. According to the Assignment (1) and Eq. (2), we have and . This implies i=j. However, this contradicts the supposition that i≠j. Therefore, there is no circuit and degeneracy in RY-curve.Similarly, we can show that there is also no circuit and degeneracy in MK-curve and SW-curve. □There exists an one-to-one correspondence between a DNA sequence and a RY-curve (MK-curve or SW-curve) and no loss of information is resulted.First consider RY-curve. From the previous proof, we know that, for a given DNA sequences S=s
1
s
2⋯s
, there exists one unique RY-curve. □Conversely, suppose that RY-curve of a DNA sequence is given; it then follows immediately that the coordinates of all n nodes on the RY-curve, (x
,y
,z
), i=1,2,…,n are given. Let (x
0,y
0,z
0)=(0,0,0). According to Eq. (2), bases s
corresponding to the node P(x
,y
,z
) on the RY-curve can be calculated using the following formula:Formula (5) consists of the followings set of equations:where a
∈{(1,−1,0),(1,1,0),(1,0,1),(1,0,−1)} and i=1,2,…,n is known. Note (x
0,y
0,z
0)=(0,0,0). Regarding (x
1,y
1,z
1))⋯(x
,y
,z
) as unknown, we obtain the coefficient matrix of the Eq. (6) to beThe determinant |A|=1≠0, therefore for the given RY-curve, Eqs. (6) have a unique solution. This implies that one RY-curve uniquely determines one correspondent DNA sequence. Hence, the correspondence between DNA sequences and RY-curves is one-to-one and there is no loss of information.Similarly, we can prove that Property 3.2 holds for MK-curve and SW-curve as well.The x-component of the vector corresponding to the node P(x
,y
,z
) of the RY-curve (MK-curve or SW-curve), x
, is just the length of the DNA sequence S=s
1
s
2⋯s
, we haveThe proof follows immediately from assignment (1) and Eq. (2). □For the RY-curve, its projections (2D curve) onto the x–y plane and the x–z plane denote the distributions of purine bases (A,G) and pyrimidine bases (T,C) along the sequence S=s
s
…,s
, respectively, and we haveFor the MK-curve, its projections (2D curve) onto the x–y plane and the x–z plane denotes the distributions of bases of amino group (A,C) and keto group (G,T) along the sequence S=s
s
,…,s
respectively, and we haveFor the SW-curve, its projections (2D curve) onto the x–y plane and the x–z plane denotes the distributions of bases of weak H-bonds (A,T) and strong H-bonds (G,C) along the sequence S=s
s
,…,s
respectively, and we haveFirst we prove (i). From assignment (1), we know that the vectors representing bases, G and A, are symmetrical about x axis on the x-y plane, and the vectors representing bases, T and C, also are symmetrical about x axis on the x-z plane. So, according to Equation (2), we haveThe projection of RY-curve onto the x–y plane is a 2D curve with nodes: {(x
,y
), k=1,2,…,n} and the projection of RY-curve onto the x–z plane is a 2D curve with nodes:{(x
,z
), k=1,2,…,n}. So, the projections of RY-curve onto the x–y plane and the x–z plane display the distributions of purine bases (A,G) and pyrimidine bases (T,C) along the sequence, respectively. □Similarly, we can prove (ii) and (iii).Fig. 4 shows the projections of the RY-curve of human’s exon 1 of beta-globin genes onto the x–y plane and the x–z plane.
Fig. 4
The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane and x–z plane. (A) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane, which denotes the distribution of purine bases (A,G) along the coding sequence. (B) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–z plane, which denotes the distribution of pyrimidine bases (T,C) along the coding sequence.
The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane and x–z plane. (A) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–y plane, which denotes the distribution of purine bases (A,G) along the coding sequence. (B) The 2D projection of the RY-curve of the first exon of beta-globin gene of human in x–z plane, which denotes the distribution of pyrimidine bases (T,C) along the coding sequence.From Fig. 4, we can see that G>A , T>C, and the changing trend of the content of bases A,G and bases T,C along the sequence is also directly observable.Let (x
,y
,z
) denote the coordinates of the terminal point on an RY-curve, then the following relationships are true:if y
>0,z
>0, then T
+G
>A
+C
.if y
<0,z
<0,then T
+G
C
.
if y
>0, z
<0,then G
+C
>A
+Tif y
<0, z
>0, then G
+Cif y
=0, z
=0, then T
+G
=A
+C
and T
=C
, G
=A
.The above results follow from assignment (1) and Eq. (2) directly.For MK-curve and SW-curve, there are analogous properties as well. □
The applications of RY-curve, MK-curve and SW-curve
In this section, we will present two applications of RY-curve, MK-curve and SW-curve.
Calculation of the base content of a DNA sequence
Based on Property 3.3, Property 3.4, we can obtain three equations for each curve.For RY-curve, we haveFor MK-curve, we haveAnd for SW-curve, we haveWithout loss of generality, we select the following four independent equations from (7), (8), (9):Notice that since the coefficient matrix of Eq. (10) is nonsingular, there exists one unique solution.The solution of Eq. (10) can be obtained recursively as follows:For example, for the complete coding sequences of beta-globin genes of human, from its RY-curve and MK-curve, we obtainSubstituting these values into formula (11), we getSimilarly, using formula (11), we can calculate the base content of DNA sequences for the eleven species presented in Table 1 (see
Table 2).
Table 2
The base contents of the 11 coding sequences of Table 1.
Species
A
G
T
C
Total
Human
17
35
21
19
92
Goat
17
35
17
17
86
Gallus
19
34
15
24
92
Mouse
17
34
23
20
94
Rat
20
33
21
18
92
Chimpanzee
20
41
24
20
105
Bovine
17
35
18
16
86
Gorilla
17
37
20
19
93
Opossum
21
29
22
20
92
Lemur
19
35
23
15
92
Rabbit
17
37
20
16
90
The base contents of the 11 coding sequences of Table 1.
Similarity analysis based on the RY-curve, MK-curve and SW-curve
Comparing similarities among different DNA sequences is one of the essential motivations of graphical representation. In order to do this, Randic et al. (2000) proposed E matrix, M/M matrix, L/L matrix and L/L matrix. They used matrix eigenvalues as the sequence descriptors to make comparisons among DNA sequences. This method has been proved to be useful and used by many authors (Randic et al., 2000, Randic et al., 2003, Wang and Zhang, 2006, Li and Wang, 2004). However, when DNA sequence is very long, these matrices can become very large, and the calculation could become very complicated. Yao et al. (2005) used the coordinates of the geometric center of graph as the sequence descriptors to do similarity analysis among different DNA sequences. The method is simple as far as calculation is concerned. However, it does have a potentially serious drawback: when the two DNA sequences under comparison contain the same proportions of the bases A, G, T and C, they may have the same geometric center, although they can be completely different. To overcome this unfavorable drawback, we develop a new method for comparing similarity between two DNA sequences based on our proposed RY-curve, MK-curve and SW-curve. A twelve-component vector that serves as a sequence descriptor is constructed based on the geometrical centers of the representing curves.
Construction of the 12-component sequence descriptor
In Section 2, we have constructed the RY-curve of representing a DNA sequence restricting purine bases R=A,G in the x–y plane and pyrimidine bases Y=T,C in the x–z plane. Conditional on this assumption, there exist four possible ways of assigning the four vectors to the four bases (A, G, T and C):Thus, from assignment (12) and Eq. (2), we could have four different kinds of RY-curve, denoted as RY-curve11, RY-curve12, RY-curve13 and RY-curve14. Note these curves are listed in the same order as they appear in Eq. (2).Analogously, for MK-curve, we can also form four kinds of MK-curve, denoted by MK-curve21, MK-curve22, MK-curve23 and MK-curve24. For SW-curve, we can also obtain four kinds of SW-curve, denoted by SW-curve31, SW-curve32, SW-curve33 and SW-curve34. Therefore, we can have a total of twelve 3D curves representing a DNA sequence.For a given sequence with length n, we have a set of points (x
, y
, z
), i=1, 2, 3,…, n, from the graphical representation of the sequence. The coordinates of the geometrical center of all the points, denoted by x
0, y
0, and z
0, can be calculated as follows (Yao et al., 2005):Next, we calculate the following index by (13):Using formula (14), we calculate an index vector based on all above twelve 3D curves, denoted byHere, we use the first subscript to denote the particular curve (RY, MK, SW) and use the second subscript to denote the four different ways concerning how the vectors are assigned. The 12-component vector (15) can be used as the sequence descriptors. To ease notational exposition, we rewrite the 12-component vector (15) as follows:
Similarity analysis of the coding sequences of beta-globin gene among different species
Comparison based on sequence descriptors is one method, which has been routinely used in similarity analysis. Here, we use the 12-component vector (15) as the index for comparing different DNA sequences.Suppose that for species i and j, their 12-component vectors areandWe introduce two measures to quantify similarity between the two species. They are the Euclidean distance d
and the correlation angle θ
:The smaller the d
and θ
are, the more the similar species i and j are.Calculating d
and θ
for all eleven species presented in Table 1, we obtain two similarity matrices: M1 and M2, where M1=(d
)11×11 and M2=(θ
)11×11. To combine information from these two matrices together, we compute a weighted sum: M(a)=aM1+(1−a)M2, (0≤a≤1), as the overall similarity matrix of the eleven species. Setting a=1/2, we compute the overall similarity matrix M (1/2) for the eleven species and list the result in
Table 3.
Table 3
The similarity matrix of the 11 coding sequences of Table 1: M(1/2).
Species
Human
Goat
Gallus
Mouse
Rat
Chimpanzee
Bovine
Gorilla
Opossum
Lemur
Rabbit
Human
0
0.0789
1.1475
1.489
1.0999
0.0062
0.0735
0.0424
0.6468
0.5246
0.3757
Goat
0
1.2189
1.5666
1.1719
0.0736
0.0057
0.0367
0.7203
0.4509
0.3034
Gallus
0
0.3509
0.0478
1.1529
1.2142
1.1849
0.6076
1.4876
1.5201
Mouse
0
0.397
1.4951
1.5617
1.5305
0.8499
1.1405
1.2874
Rat
0
1.1054
1.1671
1.1376
0.5699
1.5337
1.4727
Chimpanzee
0
0.0681
0.0371
0.6528
0.52
0.371
Bovine
0
0.0314
0.7159
0.4563
0.3086
Gorilla
0
0.6872
0.486
0.3379
Opossum
0
1.0483
0.9319
Lemur
0
0.1497
Rabbit
0
The similarity matrix of the 11 coding sequences of Table 1: M(1/2).It can be observed in Table 3 that the following pairs of species have significantly smaller similarity scores: human–chimpanzee, human–gorilla, gorilla–chimpanzee and bovine–gorilla. Gallus (the only nonmammalian species) is greatly dissimilar with the other species except for rat, because all other entries involving gallus are relatively large (see the fourth row of Table 3). It is also observed that opossum has larger similarity scores with other species (see the tenth column of Table 3). As presented in Table 3, human–goat, human–bovine, goat–bovine, goat–gorilla, chimpanzee–bovine and chimpanzee–gorilla have smaller entries, so they are only moderately similar.To compare our results with others, we list the currently published results on comparing the similarity of human and other several species in
Table 4. As one can see, there is only limited variation among these different methods, therefore these methods are in overall agreement.
Table 4
The degree of similarity of the coding sequences of several species with the coding sequences of human.
Species
Chimpanzee
Gorilla
Gallus
Opossum
Bovine
Goat
Our work, Table 3
0.0062
0.0424
1.1475
0.6468
0.0735
0.0789
Liao and Ding (2006), Table 5
0.022893
0.025960
0.0106123
0.095765
0.048664
0.052039
Liu et al. (2006), Table 5
0.0145
0.0079
0.2417
0.2815
0.0750
0.1078
Yao et al. (2008), Table 10
0.00449
0.00478
0.02916
0.02999
0.01359
0.01633
Zhang (2009), Table 1
0.9572
0.2633
1.1559
1.1863
0.3606
0.4769
Tang et al. (2010), Table 3
0.0399
0.0441
0.1766
0.1598
0.0799
0.0869
Tang et al. (2010), Table 4
0.0379
0.0423
0.1781
0.1598
0.0796
0.0855
The degree of similarity of the coding sequences of several species with the coding sequences of human.
Similarity analysis of cDNA sequences of beta-globin gene among different species
Based on the method proposed in Section 4.2.1., we compare similarities among cDNA sequences of beta‐globin gene of eight species in
Table 5. The results are listed in
Table 6.
Table 5
The cDNA sequences of beta-globin gene of 8 species.
Species
Release date
UCSC version
Length (bp)
Human
Feb. 2009
hg19/GRCh37
444
Chimpanzee
Mar. 2006
panTro2
444
Rat
Nov. 2004
rn4
444
Mouse
July 2007
mm9
444
Tetraodon
Mar. 2007
tetNig2
448
Fugu
Oct. 2004
fr2
444
Mouse lemur
Jun. 2003
micMur1
443
Bushbaby
Dec. 2006
otoGar1
444
Table 6
The similarity matrix of the cDNA of 8 species in Table 5: M(1/2).
Species
Human
Chimpanzee
Rat
Mouse
Tetraodon
Fugu
Mouse lemur
Bushbaby
Human
0
0.013149
0.36733
0.44743
0.60224
1.4339
0.37625
0.42344
Chimpanzee
0
0.38038
0.46053
0.61479
1.4465
0.38932
0.43653
Rat
0
0.083386
0.27248
1.0744
0.009278
0.057991
Mouse
0
0.19741
0.99561
0.074264
0.025603
Tetraodon
0
0.83818
0.26465
0.22101
Fugu
0
1.066
1.0195
Mouse lemur
0
0.048804
Bushbaby
0
The cDNA sequences of beta-globin gene of 8 species.The similarity matrix of the cDNA of 8 species in Table 5: M(1/2).It can be observed in Table 6 that the following pairs of species have significantly smaller similarity scores: human–chimpanzee, rat–mouse and mouse lemur–bushbaby. In fact, the eight species chosen here are four pairs of close relatives in their evolution, namely human–chimpanzee, rat–mouse, mouse lemur –bushbaby and tetraodon–fugu. However, we notice that the similarity score of tetraodon–fugu is the smallest in the seventh column of Table 6, but it is much bigger than the other three close relative entries. This problem remains to be further studied.
Conclusion
In this paper, we propose three graphical representations, namely RY-curve, MK-curve and SW-curve, to represent the DNA sequence in a 3D space. We prove that the 3D curves are strictly non-degenerate and there is no loss of information in transferring the DNA sequence to the proposed curves. Compared with other graphical representations, the main advantage of our method is that the 2D projection of RY-curve, MK-curve and SW-curve onto the x–y plane and the x–z plane has clear biological implication. For example, the 2D projection of RY-curve onto the x–y plane denotes the changing trend of the content of A, G (see Fig. 4). The three components of the terminal node of these 3D curves algebraically relate to the content of the bases: A, G, T and C (see Property 3.3, Property 3.4). Therefore, more information is retained by our method compared to other available methods. As the application of the graphical representation, we derive a simple formula to recover the content of the four kinds of bases (A, G, C and T) in a DNA sequence from the proposed curves. The sequence descriptors of 12-component vectors we have constructed enabled us to conduct similarity analysis among the coding sequences of first exon of beta-globin gene of 11 species. Our results are in overall agreement with the results reported in the article (Zhang, 2009, Yao et al., 2008, Tang et al., 2010, Liao and Ding, 2006, Liu et al., 2006) (see Table 4). We also have a good validation of similarities of cDNA sequences of the related ones by our method. Computation involved in implementing the proposed methods is fairly straightforward.