Literature DB >> 22001081

Sequence comparison via polar coordinates representation and curve tree.

Abstract

Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient. Crown

Entities: Chemical Disease Gene Species

Mesh：

Substances：
beta-Globins

Year: 2011 PMID： 22001081 PMCID： PMC7094155 DOI： 10.1016/j.jtbi.2011.09.030

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

With the development of high-throughput sequencing technology, the rate of addition of new sequences to the databases increases continuously. However, such a collection of sequences does not by itself increase the scientist's understanding of the biology of organisms. Comparing a new sequence with the sequences of known functions is an effective way of assigning function to the new genes/proteins and understanding the biology of that organism from which the new sequence comes. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming that finds an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignment-based algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982, Needleman and Wunsch, 1970, Smith and Waterman, 1981). Waterman (1995) and Durbin et al. (1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in computationally load with large biological databases and choice of the scoring schemes (Pham and Zuegg, 2004, Vinga and Almeida, 2003). Therefore, the second class, alignment-free methods, was proposed to overcome the limitations of alignment-based methods. Graphical representation is one of widely used alignment-free methods. It provides a simple way of viewing, sorting and comparing various gene sequences with their intuitive pictures and pattern. Various graphical representations have been proposed during the past 10 years (Hamori and Ruskin, 1983, Gates, 1986, Nandy, 1994, Leong and Morgenthaler, 1995, Randic et al., 2003a, Randic et al., 2003b, Randic et al., 2006, Randic et al., 2001, Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Yao and Wang, 2004, Liu et al., 2006, Randic, 2000, Qi and Qi, 2007, Qi et al., 2007). Randic et al. (2011) gave a comprehensive review on this method. All the graphical representations generally differ in two aspects: graphical representations and feature-based similarity measures. Graphical representation of DNA sequences was first proposed by Hamori and Ruskin (1983) in which DNAs have been shown as 3D curves. Gates (1986), Nandy, (1994), and Leong and Morgenthaler (1995) developed 2D graphical representations of DNA sequences. These methods are straightforward but are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Randic et al. (2003a) developed a novel 2D representation method to overcome the degeneracy of the graphical representation. Recently, several other 2D and 3D representations have been proposed (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Randic et al., 2003b, Randic et al., 2006, Randic et al., 2001, Yao and Wang, 2004, Liu et al., 2006, Randic, 2000, Qi and Qi, 2007, Qi et al., 2007). According to the handling bases of biological sequences, all the methods can be classified as: single nucleotide-based (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Randic et al., 2003a, Randic et al., 2006, Yao and Wang, 2004) and dual nucleotide-based representations (Liu et al., 2006, Randic, 2000, Randic et al., 2001, Qi and Qi, 2007, Qi et al., 2007). They often assign the n bases to corresponding points (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011), to the four lines (Randic et al., 2003b, Randic et al., 2006), or to the cell/system (Yao and Wang, 2004, Liu et al., 2006) to design the graphical representation. Some features of graphical representations have been proposed to capture the essence of the base composition/distribution of the sequences and further facilitate biological sequence comparison. These widely used features are always associated with the central coordinate and distance matrices. The central coordinate can effectively characterize the whole changes of the geometrical curves and has been widely used for biological sequence comparison (Liao and Ding, 2006, Wen and Zhang, 2009, Abo ElMaaty et al., 2010). Another useful tool for characterization of biological sequences is distance matrix that is proposed by Randic and Vracko (2000) and further developed by Randic et al. (2001), Song and Tang (2005), and Liao and Wang (2004). They first transformed the graphical representations of biological sequences into distance matrices such as E matrix, D/D matrix, L/L matrix and their ‘‘high order’’ matrices. Then they extracted the invariants of matrices such as leading eigenvalue, average row element, etc. to numerically characterize the biological sequences and designed the feature-based similarity measure for sequence comparison. Although the above graphical representation methods have achieved promising results, there are some problems in developing graphical representations and designing the feature-based similarity measures. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Second, the choice of the direction/points/cells for the single base or dual nucleotides is arbitrary. Finally, the feature-based similarity measures are always associated with the invariants of the distance matrices that are gotten by complex repetitive computation. When the sequences are long, these kinds of feature-based similarity measures become less useful. Moreover, we believe that better representation and similarity measure will allow us to design more effective sequence comparison method. This paper introduced a novel method to represent and compare biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. The closed curve was then transformed into a curve tree instead of the distance matrix, and a tree matching distance was proposed to estimate the similarity of two biological sequences. To assess the effectiveness of the proposed method, we took two experiments and compared its performance with the alignment-based method.

Method

Polar coordinates representation of DNA sequences

Given a DNA Sequence, almost all the graphical representations map it into a curve in Cartesian coordinates. The polar coordinates have not been used for graphical representation of DNA sequence until now. In addition, dual nucleotides have been introduced to design graphical representations (Liu et al., 2006, Randic, 2000, Randic et al., 2001, Qi and Qi, 2007, Qi et al., 2007), in which each dual nucleotide is assigned to a corresponding point in Cartesian coordinates, but the distribution of the dual nucleotides is not considered in graphical representation. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. Given a DNA sequences, there are 16 kinds of the dual nucleotides. The distribution of the dual nucleotides consists of their frequencies in a given sequence. For a sequence s, the frequency of a dual nucleotide w , denoted by f(w ), is the number of occurrence of w in the sequence s, where , . The standard approach for calculating the frequencies of the dual nucleotide in a sequence of length m is to use a sliding window of length 2, shifting the frame one base at a time from position 1 to m−2+1, in which dual nucleotides are allowed to overlap in the sequence. In this way, the distribution of the dual nucleotides is represented by a 16-dimensional vector where N(w ) is the count of the dual nucleotides XY in DNA sequence s, , . For convenience, we denote the vector as . When the vector of DNA sequence is given, we are interested in its graphical representation in polar Coordinates. We first calculate the radius and angles of the distribution of the dual nucleotides as follows:where is a weighted value. Then we plot 16 feature points based the above radius and angles in the polar coordinates. Spline function is introduced to fit a smooth curve to a set of the radius and angles of the distribution of the dual nucleotides. Consider a cubic spline with abscissas x and ordinates y , i=0,2,…,N−1. If the second derivatives at each point are known, the spline function has the formwhere , , , , and is the second derivatives. Here, we choose x based on the distribution and the biological sequence Using the spline function, we obtain the function values y(i), i=1,2,3,…,n−1. Plot x(i) and y(i), we will get the closed curve of a DNA sequence in the polar coordinates. For example, Fig. 1 is the polar representation of the coding sequence of the first exon of β-globin gene of Human.

Fig. 1

Polar representation of coding sequence of the first exon of Human β-globin gene with w=1.

Polar representation of coding sequence of the first exon of Human β-globin gene with w=1. As for the parameter in the definition of radius and angles, we have performed extensive experiments. The polar coordinates representation with different show a clear trends: is suitable for the short sequence. As the sequence length increases, its closed curve with is becoming more and more like a circle, which is not suitable for comparing various sequences with their intuitive pictures and pattern. For example, the polar coordinates representation of HCoV-229E coronavirus genomes is shown in Fig. 2(a). At the same time, if the value of is too large, the curvature difference on the small-scale will be covered that is not good for sequence representation either. Therefore, we should increase to a suitable value if the sequence length is large. As for the sequence, its polar representation becomes more inerratic with represented in Fig. 2(b).

Fig. 2

Polar representation of HCoV-229E coronavirus genomes with =1 (a) and =2 (b).

Curve tree

In order to facilitate characterize and compare different polar representations, we map the closed curve into a curve tree that is constructed as following two steps: (1) dividing a closed curve into two open curves and determining the direction of them; (2) constructing the curve tree. Given a closed curve, we should divide it into two open curves and determine the direction of the curves. Take a closed curve represented in Fig. 3 as an example, we first find the two farthest points on the curve: A and B, with which the closed curve is divided into two open curves AM1B and BM2A, M1 and M2 are the intersections of the perpendicular bisector of line AB with the two open curves. Then, compare the curvilinear path of the four curves: M1A, M1B, M2A and M2B. Their comparison determines the initial points and directions of the two open curves. For example, if the curvilinear path of curve M1A is longer than other three curves, we define the curve AM1B as AB that is the first curve with initial point A, and the curve BM2A is the second curve denoted by CD whose initial point is C. When the lengths of the four curves are equal, we will define the initial point and the direction as the curvilinear path of curve M1A is the longest.

Fig. 3

Closed curve with two farthest points A and B, M1M2 is the perpendicular bisector of the line AB.

Closed curve with two farthest points A and B, M1M2 is the perpendicular bisector of the line AB. As for an open curve presented in Fig. 4, we will find its initial point A and M0, the midpoint of line AB. We draw the perpendicular bisector of line AB denoted by M0M that intersects the line AB at M. Then, we define h=(−1)aMM0/AM0 as the directed relative height of the line AB, where . If the vector , and the vector that is upright perpendicular to the plane AMM0, are satisfied with the Right-Hand Rule, a is equal to 1; otherwise, a is equal to 0.

Fig. 4

Curve with its initial point A, M0 is the midpoint of line AB.

Curve with its initial point A, M0 is the midpoint of line AB. The curve tree nodes store the directed relative heights of the line. Taking the curve of Fig. 5(a) for an example, the directed relative height of line AB, denoted as h 0,0, is stored in the root node of the curve tree. The curve AB is divided into two open curves by the point M0,0. For these two open curves, we repeat the procedure mentioned above, their results are shown in Fig. 5(b). We obtain the left child node and right child node of the root node h 0,0, which are denoted as h 1,0 and h 1,1, respectively. We then take the notes h 1,0 and h 1,1 as root nodes of the sub-trees, and repeat the operations until the obtained curve is almost a straight line. At last, we get a curve tree of the line AB, which is presented in Fig. 6.

Fig. 5

Construction of curve tree, (a) the curve AB is divided into two open curves by the point M0,0, (b) the curves AM0,0 and M0,0B is further divided into two open curves by the point M1,0 and M1,1.

Fig. 6

Curve tree of the curve presented in Fig. 5.

Construction of curve tree, (a) the curve AB is divided into two open curves by the point M0,0, (b) the curves AM0,0 and M0,0B is further divided into two open curves by the point M1,0 and M1,1. Curve tree of the curve presented in Fig. 5.

Curve match distance

Given a DNA sequence, we can map it into a closed curve and construct a curve tree to characterize the closed curve. Here, we are not only interested in using the curve tree to characterize the closed curve, but also interested in facilitating comparison of the polar representation of DNA sequences. As we all know, the more similar two sequences, the closer the two sequences relate. It is also true for the curve trees. That is to say, if the relations of two curve trees are closer, they are more similar. On the basis of this assumption, we define a curve match distance of the two curves C1 and C2 as follows:where () is the values of the j-th node on the i-th layer in the complete binary tree corresponding to the curve tree T1 (T2) of the curve C1 (C2), is a weight, n is the layers that is determined by the actual precision required and curvature curve, is monotonically increasing function that is defined as following: The function enlarges the distance between the larger local differences to improve comparison accuracy, and reduces the distance between the smaller local differences to increase the anti-jamming ability of the curve distance. is a weight series that influences the distance by the element differences of the curve tree. The value of should be set as the appropriate value on the basis of actual needs. If the differences of large-scale curvatures are as the same as that of small-scale curvatures of the curve, it is better to choose as a constant series. If we pay more attention on the curvature difference on the large-scale when comparing the different curves, it is better to choose as a descending series; otherwise, it is better to choose as an increasing series.

Results and discussion

Biological sequence comparison is the essential motivation of polar representation of DNA sequences. Here, we propose intuitive and quantitative methods to compare biological sequences with help of the proposed polar representation.

Sequence comparison with polar representation

The alphabet representation of biological sequences is easily handled with computer but difficult for us to observe their differences. Graphical representation provides us with a simple way to view various biological sequences and facilitate sequence comparison with the intuitive pictures and pattern. In Fig. 7, we present the polar representations of the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit. Comparing the closed curves, it is easily to find that the most similar pair is Human–Gorilla because they are Primates. The more similar pairs are Human–Rabbit and Gorilla–Rabbit, which is consist with the (Ferungulates, (Primates, Rodents)) grouping (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Liao and Ding, 2006, Wen and Zhang, 2009). The closed curve of Gallus is dissimilar to the other because it is the only non-mammal animals among them. Therefore, polar representation provides us with a simple way to compare different biological sequences.

Fig. 7

Polar representations for the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit.

Polar representations for the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit. For comparison, we list the recently published results of the examination of the degree of similarity between Human and other several species in Fig. 8 (Zhang, 2009, Yu et al., 2009, Wang et al., 2010, Xie and Mo, 2011). As one can see there is an overall agreement among similarities obtained by different approaches despite some variation among them. But it is also noted that the degree of dissimilarities of Human–Goat and Human–Bovine are larger than that of Human–Opposum and Human–Gallus, which is an undesirable result because Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals.

Fig. 8

Comparison degree of similarity/dissimilarity between Human and other several species in Table 1 with the proposed method and methods in Zhang (2009), Yu et al. (2009), Wang et al. (2010), and Xie and Mo (2011).

Table 1

Sequences of the first exon of β-globin gene of different species.

Species	Coding sequence
Human	ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
Goat	ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAA AGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG
Opossum	ATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAG GTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG
Gallus	ATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAA GGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG
Lemur	ATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAG GTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG
Mouse	ATGGTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAA GGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
Rabbit	ATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAG GTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC
Rat	ATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAG GTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG
Gorilla	ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
Bovine	ATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAA GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
Chimpanzee	ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG

Phylogenetic analysis

Since the proposed curve matching distance is a distance measure, we can further evaluate the proposed method with phylogenetic analysis. Here, we choose two date sets that have been studied by many researchers (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Liao and Ding, 2006, Wen and Zhang, 2009, Gu et al., 2004, Zhang, 2009). The first data set consists of the first exon of β-globin gene of 11 species presented in Table 1. It is a small data set with average sequence length 92. The second data set are 24 coronavirus genomes with average length about 30,000. They are downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2.

Table 2

Accession number, abbreviation, name and length for each of the 24 coronavirus genomes.

No	Accession	Group	Abbreviation	Genome	Length (nt)
1	NC_002645	I	HCoV-229E	Human coronavirus 229E	27,317
2	NC_002306	I	TGEV	Transmissible gastroenteritis virus	28,586
3	NC_002436	I	PEDV	Porcine epidemic diarrhea virus	28,033
4	U00735	II	BCoVM	Bovine coronavirus strain Mebus	31,032
5	AF391542	II	BCoVL	Bovine coronavirus isolate BCoV-LUN	31,028
6	AF220295	II	BCoVQ	Bovine coronavirus strain Quebec	31,100
7	NC_003045	II	BCoV	Bovine coronavirus	31,028
8	AF208067	II	MHVM	Murine hepatitis virus strain ML-10	31,100
9	AF201929	II	MHV2	Murine hepatitis virus stain 2	31,028
10	AF208066	II	MHVP	Murine hepatitis virus strain Penn 97-1	31,233
11	NC_001846	II	MHV	Murine hepatitis virus	31,276
12	NC_001451	III	IBV	Avian infectious bronchitis virus	27,608
13	AY278488	IV	BJ01	SARS coronavirus BJ01	29,725
14	AY278741	IV	Urbani	SARS coronavirus Urbani	29,727
15	AY278491	IV	HKU-39849	SARS coronavirus HKU-39849	29,742
16	AY278554	IV	CUHK-W1	SARS coronavirus CUHK-W1	29,736
17	AY282752	IV	CUHK-Su10	SARS coronavirus CUHK-Su10	29,736
18	AY283794	IV	SIN2500	SARS coronavirus Sin2500	29,711
19	AY283795	IV	SIN2677	SARS coronavirus Sin2677	29,705
20	AY283796	IV	SIN2679	SARS coronavirus Sin2679	29,711
21	AY283797	IV	SIN2748	SARS coronavirus Sin2748	29,706
22	AY283798	IV	SIN2774	SARS coronavirus Sin2774	29,711
23	AY291451	IV	TW1	SARS coronavirus TW1	29,729
24	NC_004718	IV	TOR2	SARS coronavirus	29,751

Sequences of the first exon of β-globin gene of different species. Accession number, abbreviation, name and length for each of the 24 coronavirus genomes. Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the polar representation of biological sequences and calculate their curve matching distance based on the curve tree; secondly, by arranging all the curve matching distance into a matrix, we obtain a pair-wise distance matrix; finally, we put the pair-wise distance matrix into the neighbor-joining program in the PHYLIP package (Felsenstein, 1989). Fig. 9(a) is phylogenetic tree of the first exon of β-globin gene of 11 species using the proposed method with and . Fig. 10(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed method with and .

Fig. 9

Phylogenetic tree of 11 species based on (a) the proposed method, (b) the multiple alignment CLUSTAL X.

Fig. 10

Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed method and (b) the multiple alignment CLUSTAL X.

Phylogenetic tree of 11 species based on (a) the proposed method, (b) the multiple alignment CLUSTAL X. Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed method and (b) the multiple alignment CLUSTAL X. Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopt the form one to test the validity of our phylogenetic tree. Both two data sets are aligned with the multiple alignmen CLUSTAL X and use the neighbor-joining to construct the phylogenetic tree presented in Figs. 9(b) and 10(b). From Fig. 9, we find that the eleven species are separated clearly in our results: (1) three Primates (Human, Gorilla and Chimpanzee) are clustered closely; (2) two Rodents (Mouse and Rat) are grouped closely; (3) Rabbit is clustered closely with Human, Gorilla and Chimpanzee. (4) Opossum and Gallus are less closely with other species, which is consistent with the fact that Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals. Our results are consistent with the results of the multiple alignment CLUSTAL X (Fig. 9(b)). Fig. 10(a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004, Zhang, 2009) and that of the multiple alignment in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses.

Conclusion

Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. Graphical representation is one of widely used alignment-free methods to view, sort, and compare biological sequences. This work presented a novel method to represent and compare biological sequence. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Then we transformed the closed curve into a curve tree instead of the distance matrix, and proposed a tree matching distance to estimate the similarity of two biological sequences. To compare the effectiveness of the proposed method, we performed extensive tests including similarity of biological sequences and phylogenetic analysis, and compared its performance with alignment-based method. The results demonstrate that the proposed method is efficient, which highlight the necessity for graphical representation method to consider the whole distribution of the dual bases. Thus, this understanding can then be used to guide development of more powerful graphical representation for biological sequence comparison.

23 in total

1. On the similarity of DNA primary sequences.

Authors: M Randić; M Vracko
Journal: J Chem Inf Comput Sci Date: 2000 May-Jun

2. On the characterization of DNA primary sequences by triplet of nucleic acid bases.

Authors: M Randić; X Guo; S C Basak
Journal: J Chem Inf Comput Sci Date: 2001 May-Jun

3. 2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids.

Authors: Zhi-Cheng Wu; Xuan Xiao; Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-08-07 Impact factor: 2.691

4. Bilateral similarity function: a novel and universal method for similarity analysis of biological sequences.

Authors: Shiyuan Wang; Fengchun Tian; Yu Qiu; Xiao Liu
Journal: J Theor Biol Date: 2010-04-21 Impact factor: 2.691

5. Invariants of DNA sequences based on 2DD-curves.

Authors: Yusen Zhang; Wei Chen
Journal: J Theor Biol Date: 2006-05-02 Impact factor: 2.691

6. A simple way to look at DNA.

Authors: M A Gates
Journal: J Theor Biol Date: 1986-04-07 Impact factor: 2.691

7. A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors: S B Needleman; C D Wunsch
Journal: J Mol Biol Date: 1970-03 Impact factor: 5.469

8. Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications.

Authors: Guosen Xie; Zhongxi Mo
Journal: J Theor Biol Date: 2010-10-20 Impact factor: 2.691