Literature DB >> 22001081

Sequence comparison via polar coordinates representation and curve tree.

Qi Dai1, Xiaodong Guo, Lihua Li.   

Abstract

Sequence comparison has become one of the essential bioinformatics tools in bioinformatics research, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Existing graphical representation methods have achieved promising results in sequence comparison, but there are some design challenges with the graphical representations and feature-based measures. We reported here a new method for sequence comparison. It considers whole distribution of dual bases and employs polar coordinates method to map a biological sequence into a closed curve. The curve tree was then constructed to numerically characterize the closed curve of biological sequences, and further compared biological sequences by evaluating the distance of the curve tree of the query sequence matching against a corresponding curve tree of the template sequence. The proposed method was tested by phylogenetic analysis, and its performance was further compared with alignment-based methods. The results demonstrate that using polar coordinates representation and curve tree to compare sequences is more efficient. Crown
Copyright © 2011. Published by Elsevier Ltd. All rights reserved.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22001081      PMCID: PMC7094155          DOI: 10.1016/j.jtbi.2011.09.030

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


Introduction

With the development of high-throughput sequencing technology, the rate of addition of new sequences to the databases increases continuously. However, such a collection of sequences does not by itself increase the scientist's understanding of the biology of organisms. Comparing a new sequence with the sequences of known functions is an effective way of assigning function to the new genes/proteins and understanding the biology of that organism from which the new sequence comes. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming that finds an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignment-based algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982, Needleman and Wunsch, 1970, Smith and Waterman, 1981). Waterman (1995) and Durbin et al. (1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in computationally load with large biological databases and choice of the scoring schemes (Pham and Zuegg, 2004, Vinga and Almeida, 2003). Therefore, the second class, alignment-free methods, was proposed to overcome the limitations of alignment-based methods. Graphical representation is one of widely used alignment-free methods. It provides a simple way of viewing, sorting and comparing various gene sequences with their intuitive pictures and pattern. Various graphical representations have been proposed during the past 10 years (Hamori and Ruskin, 1983, Gates, 1986, Nandy, 1994, Leong and Morgenthaler, 1995, Randic et al., 2003a, Randic et al., 2003b, Randic et al., 2006, Randic et al., 2001, Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Yao and Wang, 2004, Liu et al., 2006, Randic, 2000, Qi and Qi, 2007, Qi et al., 2007). Randic et al. (2011) gave a comprehensive review on this method. All the graphical representations generally differ in two aspects: graphical representations and feature-based similarity measures. Graphical representation of DNA sequences was first proposed by Hamori and Ruskin (1983) in which DNAs have been shown as 3D curves. Gates (1986), Nandy, (1994), and Leong and Morgenthaler (1995) developed 2D graphical representations of DNA sequences. These methods are straightforward but are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Randic et al. (2003a) developed a novel 2D representation method to overcome the degeneracy of the graphical representation. Recently, several other 2D and 3D representations have been proposed (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Randic et al., 2003b, Randic et al., 2006, Randic et al., 2001, Yao and Wang, 2004, Liu et al., 2006, Randic, 2000, Qi and Qi, 2007, Qi et al., 2007). According to the handling bases of biological sequences, all the methods can be classified as: single nucleotide-based (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011, Randic et al., 2003a, Randic et al., 2006, Yao and Wang, 2004) and dual nucleotide-based representations (Liu et al., 2006, Randic, 2000, Randic et al., 2001, Qi and Qi, 2007, Qi et al., 2007). They often assign the n bases to corresponding points (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Huang et al., 2011, Wu et al., 2010, Maaty et al., 2010a, Maaty et al., 2010b, Bai et al., 2011, Xie and Mo, 2011), to the four lines (Randic et al., 2003b, Randic et al., 2006), or to the cell/system (Yao and Wang, 2004, Liu et al., 2006) to design the graphical representation. Some features of graphical representations have been proposed to capture the essence of the base composition/distribution of the sequences and further facilitate biological sequence comparison. These widely used features are always associated with the central coordinate and distance matrices. The central coordinate can effectively characterize the whole changes of the geometrical curves and has been widely used for biological sequence comparison (Liao and Ding, 2006, Wen and Zhang, 2009, Abo ElMaaty et al., 2010). Another useful tool for characterization of biological sequences is distance matrix that is proposed by Randic and Vracko (2000) and further developed by Randic et al. (2001), Song and Tang (2005), and Liao and Wang (2004). They first transformed the graphical representations of biological sequences into distance matrices such as E matrix, D/D matrix, L/L matrix and their ‘‘high order’’ matrices. Then they extracted the invariants of matrices such as leading eigenvalue, average row element, etc. to numerically characterize the biological sequences and designed the feature-based similarity measure for sequence comparison. Although the above graphical representation methods have achieved promising results, there are some problems in developing graphical representations and designing the feature-based similarity measures. First, many graphical representations were designed by assigning the single bases or dual nucleotides to corresponding direction/points/cells in Cartesian coordinates, so little attention has been paid to the whole distribution of the single nucleotide or dual nucleotides in biological sequences. Second, the choice of the direction/points/cells for the single base or dual nucleotides is arbitrary. Finally, the feature-based similarity measures are always associated with the invariants of the distance matrices that are gotten by complex repetitive computation. When the sequences are long, these kinds of feature-based similarity measures become less useful. Moreover, we believe that better representation and similarity measure will allow us to design more effective sequence comparison method. This paper introduced a novel method to represent and compare biological sequences. Based on the whole distribution of the dual bases, we proposed a polar coordinates representation that maps a biological sequence into a closed curve. The closed curve was then transformed into a curve tree instead of the distance matrix, and a tree matching distance was proposed to estimate the similarity of two biological sequences. To assess the effectiveness of the proposed method, we took two experiments and compared its performance with the alignment-based method.

Method

Polar coordinates representation of DNA sequences

Given a DNA Sequence, almost all the graphical representations map it into a curve in Cartesian coordinates. The polar coordinates have not been used for graphical representation of DNA sequence until now. In addition, dual nucleotides have been introduced to design graphical representations (Liu et al., 2006, Randic, 2000, Randic et al., 2001, Qi and Qi, 2007, Qi et al., 2007), in which each dual nucleotide is assigned to a corresponding point in Cartesian coordinates, but the distribution of the dual nucleotides is not considered in graphical representation. Here, we propose a novel graphical representation of DNA sequence in polar coordinates based on the distribution of the dual nucleotides. Given a DNA sequences, there are 16 kinds of the dual nucleotides. The distribution of the dual nucleotides consists of their frequencies in a given sequence. For a sequence s, the frequency of a dual nucleotide w , denoted by f(w ), is the number of occurrence of w in the sequence s, where , . The standard approach for calculating the frequencies of the dual nucleotide in a sequence of length m is to use a sliding window of length 2, shifting the frame one base at a time from position 1 to m−2+1, in which dual nucleotides are allowed to overlap in the sequence. In this way, the distribution of the dual nucleotides is represented by a 16-dimensional vector where N(w ) is the count of the dual nucleotides XY in DNA sequence s, , . For convenience, we denote the vector as . When the vector of DNA sequence is given, we are interested in its graphical representation in polar Coordinates. We first calculate the radius and angles of the distribution of the dual nucleotides as follows:where is a weighted value. Then we plot 16 feature points based the above radius and angles in the polar coordinates. Spline function is introduced to fit a smooth curve to a set of the radius and angles of the distribution of the dual nucleotides. Consider a cubic spline with abscissas x and ordinates y , i=0,2,…,N−1. If the second derivatives at each point are known, the spline function has the formwhere , , , , and is the second derivatives. Here, we choose x based on the distribution and the biological sequence Using the spline function, we obtain the function values y(i), i=1,2,3,…,n−1. Plot x(i) and y(i), we will get the closed curve of a DNA sequence in the polar coordinates. For example, Fig. 1 is the polar representation of the coding sequence of the first exon of β-globin gene of Human.
Fig. 1

Polar representation of coding sequence of the first exon of Human β-globin gene with w=1.

Polar representation of coding sequence of the first exon of Human β-globin gene with w=1. As for the parameter in the definition of radius and angles, we have performed extensive experiments. The polar coordinates representation with different show a clear trends: is suitable for the short sequence. As the sequence length increases, its closed curve with is becoming more and more like a circle, which is not suitable for comparing various sequences with their intuitive pictures and pattern. For example, the polar coordinates representation of HCoV-229E coronavirus genomes is shown in Fig. 2(a). At the same time, if the value of is too large, the curvature difference on the small-scale will be covered that is not good for sequence representation either. Therefore, we should increase to a suitable value if the sequence length is large. As for the sequence, its polar representation becomes more inerratic with represented in Fig. 2(b).
Fig. 2

Polar representation of HCoV-229E coronavirus genomes with =1 (a) and =2 (b).

Polar representation of HCoV-229E coronavirus genomes with =1 (a) and =2 (b).

Curve tree

In order to facilitate characterize and compare different polar representations, we map the closed curve into a curve tree that is constructed as following two steps: (1) dividing a closed curve into two open curves and determining the direction of them; (2) constructing the curve tree. Given a closed curve, we should divide it into two open curves and determine the direction of the curves. Take a closed curve represented in Fig. 3 as an example, we first find the two farthest points on the curve: A and B, with which the closed curve is divided into two open curves AM1B and BM2A, M1 and M2 are the intersections of the perpendicular bisector of line AB with the two open curves. Then, compare the curvilinear path of the four curves: M1A, M1B, M2A and M2B. Their comparison determines the initial points and directions of the two open curves. For example, if the curvilinear path of curve M1A is longer than other three curves, we define the curve AM1B as AB that is the first curve with initial point A, and the curve BM2A is the second curve denoted by CD whose initial point is C. When the lengths of the four curves are equal, we will define the initial point and the direction as the curvilinear path of curve M1A is the longest.
Fig. 3

Closed curve with two farthest points A and B, M1M2 is the perpendicular bisector of the line AB.

Closed curve with two farthest points A and B, M1M2 is the perpendicular bisector of the line AB. As for an open curve presented in Fig. 4, we will find its initial point A and M0, the midpoint of line AB. We draw the perpendicular bisector of line AB denoted by M0M that intersects the line AB at M. Then, we define h=(−1)aMM0/AM0 as the directed relative height of the line AB, where . If the vector , and the vector that is upright perpendicular to the plane AMM0, are satisfied with the Right-Hand Rule, a is equal to 1; otherwise, a is equal to 0.
Fig. 4

Curve with its initial point A, M0 is the midpoint of line AB.

Curve with its initial point A, M0 is the midpoint of line AB. The curve tree nodes store the directed relative heights of the line. Taking the curve of Fig. 5(a) for an example, the directed relative height of line AB, denoted as h 0,0, is stored in the root node of the curve tree. The curve AB is divided into two open curves by the point M0,0. For these two open curves, we repeat the procedure mentioned above, their results are shown in Fig. 5(b). We obtain the left child node and right child node of the root node h 0,0, which are denoted as h 1,0 and h 1,1, respectively. We then take the notes h 1,0 and h 1,1 as root nodes of the sub-trees, and repeat the operations until the obtained curve is almost a straight line. At last, we get a curve tree of the line AB, which is presented in Fig. 6.
Fig. 5

Construction of curve tree, (a) the curve AB is divided into two open curves by the point M0,0, (b) the curves AM0,0 and M0,0B is further divided into two open curves by the point M1,0 and M1,1.

Fig. 6

Curve tree of the curve presented in Fig. 5.

Construction of curve tree, (a) the curve AB is divided into two open curves by the point M0,0, (b) the curves AM0,0 and M0,0B is further divided into two open curves by the point M1,0 and M1,1. Curve tree of the curve presented in Fig. 5.

Curve match distance

Given a DNA sequence, we can map it into a closed curve and construct a curve tree to characterize the closed curve. Here, we are not only interested in using the curve tree to characterize the closed curve, but also interested in facilitating comparison of the polar representation of DNA sequences. As we all know, the more similar two sequences, the closer the two sequences relate. It is also true for the curve trees. That is to say, if the relations of two curve trees are closer, they are more similar. On the basis of this assumption, we define a curve match distance of the two curves C1 and C2 as follows:where () is the values of the j-th node on the i-th layer in the complete binary tree corresponding to the curve tree T1 (T2) of the curve C1 (C2), is a weight, n is the layers that is determined by the actual precision required and curvature curve, is monotonically increasing function that is defined as following: The function enlarges the distance between the larger local differences to improve comparison accuracy, and reduces the distance between the smaller local differences to increase the anti-jamming ability of the curve distance. is a weight series that influences the distance by the element differences of the curve tree. The value of should be set as the appropriate value on the basis of actual needs. If the differences of large-scale curvatures are as the same as that of small-scale curvatures of the curve, it is better to choose as a constant series. If we pay more attention on the curvature difference on the large-scale when comparing the different curves, it is better to choose as a descending series; otherwise, it is better to choose as an increasing series.

Results and discussion

Biological sequence comparison is the essential motivation of polar representation of DNA sequences. Here, we propose intuitive and quantitative methods to compare biological sequences with help of the proposed polar representation.

Sequence comparison with polar representation

The alphabet representation of biological sequences is easily handled with computer but difficult for us to observe their differences. Graphical representation provides us with a simple way to view various biological sequences and facilitate sequence comparison with the intuitive pictures and pattern. In Fig. 7, we present the polar representations of the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit. Comparing the closed curves, it is easily to find that the most similar pair is HumanGorilla because they are Primates. The more similar pairs are HumanRabbit and GorillaRabbit, which is consist with the (Ferungulates, (Primates, Rodents)) grouping (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Liao and Ding, 2006, Wen and Zhang, 2009). The closed curve of Gallus is dissimilar to the other because it is the only non-mammal animals among them. Therefore, polar representation provides us with a simple way to compare different biological sequences.
Fig. 7

Polar representations for the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit.

Polar representations for the first coding sequences of β-globin gene of Human, Gorilla, Gallus, and Rabbit. For comparison, we list the recently published results of the examination of the degree of similarity between Human and other several species in Fig. 8 (Zhang, 2009, Yu et al., 2009, Wang et al., 2010, Xie and Mo, 2011). As one can see there is an overall agreement among similarities obtained by different approaches despite some variation among them. But it is also noted that the degree of dissimilarities of HumanGoat and HumanBovine are larger than that of Human–Opposum and Human–Gallus, which is an undesirable result because Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals.
Fig. 8

Comparison degree of similarity/dissimilarity between Human and other several species in Table 1 with the proposed method and methods in Zhang (2009), Yu et al. (2009), Wang et al. (2010), and Xie and Mo (2011).

Comparison degree of similarity/dissimilarity between Human and other several species in Table 1 with the proposed method and methods in Zhang (2009), Yu et al. (2009), Wang et al. (2010), and Xie and Mo (2011).
Table 1

Sequences of the first exon of β-globin gene of different species.

SpeciesCoding sequence
HumanATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
GoatATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGGCTTCTGGGGCAAGGTGAA AGTGGATGAAGTTGGTGCTGAGGCCCTGGGCAG
OpossumATGGTGCACTTGACTTCTGAGGAGAAGAACTGCATCACTACCATCTGGTCTAAG GTGCAGGTTGACCAGACTGGTGGTGAGGCCCTTGGCAG
GallusATGGTGCACTGGACTGCTGAGGAGAAGCAGCTCATCACCGGCCTCTGGGGCAA GGTCAATGTGGCCGAATGTGGGGCCGAAGCCCTGGCCAG
LemurATGACTTTGCTGAGTGCTGAGGAGAATGCTCATGTCACCTCTCTGTGGGGCAAG GTGGATGTAGAGAAAGTTGGTGGCGAGGCCTTGGGCAG
MouseATGGTGCACCTGACTGATGCTGAGAAGTCTGCTGTCTCTTGCCTGTGGGCAAA GGTGAACCCCGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
RabbitATGGTGCATCTGTCCAGTGAGGAGAAGTCTGCGGTCACTGCCCTGTGGGGCAAG GTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGGC
RatATGGTGCACCTAACTGATGCTGAGAAGGCTACTGTTAGTGGCCTGTGGGGAAAG GTGAACCCTGATAATGTTGGCGCTGAGGCCCTGGGCAG
GorillaATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGG
BovineATGCTGACTGCTGAGGAGAAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAA GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAG
ChimpanzeeATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAG GTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGG

Phylogenetic analysis

Since the proposed curve matching distance is a distance measure, we can further evaluate the proposed method with phylogenetic analysis. Here, we choose two date sets that have been studied by many researchers (Liao and Wang, 2004, Chi and Ding, 2005, Yao et al., 2005, Zhang and Liao, 2007, Zhang and Chen, 2006, Huang et al., 2009, Liao and Ding, 2006, Wen and Zhang, 2009, Gu et al., 2004, Zhang, 2009). The first data set consists of the first exon of β-globin gene of 11 species presented in Table 1. It is a small data set with average sequence length 92. The second data set are 24 coronavirus genomes with average length about 30,000. They are downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2.
Table 2

Accession number, abbreviation, name and length for each of the 24 coronavirus genomes.

NoAccessionGroupAbbreviationGenomeLength (nt)
1NC_002645IHCoV-229EHuman coronavirus 229E27,317
2NC_002306ITGEVTransmissible gastroenteritis virus28,586
3NC_002436IPEDVPorcine epidemic diarrhea virus28,033
4U00735IIBCoVMBovine coronavirus strain Mebus31,032
5AF391542IIBCoVLBovine coronavirus isolate BCoV-LUN31,028
6AF220295IIBCoVQBovine coronavirus strain Quebec31,100
7NC_003045IIBCoVBovine coronavirus31,028
8AF208067IIMHVMMurine hepatitis virus strain ML-1031,100
9AF201929IIMHV2Murine hepatitis virus stain 231,028
10AF208066IIMHVPMurine hepatitis virus strain Penn 97-131,233
11NC_001846IIMHVMurine hepatitis virus31,276
12NC_001451IIIIBVAvian infectious bronchitis virus27,608
13AY278488IVBJ01SARS coronavirus BJ0129,725
14AY278741IVUrbaniSARS coronavirus Urbani29,727
15AY278491IVHKU-39849SARS coronavirus HKU-3984929,742
16AY278554IVCUHK-W1SARS coronavirus CUHK-W129,736
17AY282752IVCUHK-Su10SARS coronavirus CUHK-Su1029,736
18AY283794IVSIN2500SARS coronavirus Sin250029,711
19AY283795IVSIN2677SARS coronavirus Sin267729,705
20AY283796IVSIN2679SARS coronavirus Sin267929,711
21AY283797IVSIN2748SARS coronavirus Sin274829,706
22AY283798IVSIN2774SARS coronavirus Sin277429,711
23AY291451IVTW1SARS coronavirus TW129,729
24NC_004718IVTOR2SARS coronavirus29,751
Sequences of the first exon of β-globin gene of different species. Accession number, abbreviation, name and length for each of the 24 coronavirus genomes. Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the polar representation of biological sequences and calculate their curve matching distance based on the curve tree; secondly, by arranging all the curve matching distance into a matrix, we obtain a pair-wise distance matrix; finally, we put the pair-wise distance matrix into the neighbor-joining program in the PHYLIP package (Felsenstein, 1989). Fig. 9(a) is phylogenetic tree of the first exon of β-globin gene of 11 species using the proposed method with and . Fig. 10(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed method with and .
Fig. 9

Phylogenetic tree of 11 species based on (a) the proposed method, (b) the multiple alignment CLUSTAL X.

Fig. 10

Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed method and (b) the multiple alignment CLUSTAL X.

Phylogenetic tree of 11 species based on (a) the proposed method, (b) the multiple alignment CLUSTAL X. Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed method and (b) the multiple alignment CLUSTAL X. Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopt the form one to test the validity of our phylogenetic tree. Both two data sets are aligned with the multiple alignmen CLUSTAL X and use the neighbor-joining to construct the phylogenetic tree presented in Figs. 9(b) and 10(b). From Fig. 9, we find that the eleven species are separated clearly in our results: (1) three Primates (Human, Gorilla and Chimpanzee) are clustered closely; (2) two Rodents (Mouse and Rat) are grouped closely; (3) Rabbit is clustered closely with Human, Gorilla and Chimpanzee. (4) Opossum and Gallus are less closely with other species, which is consistent with the fact that Gallus is the only non-mammal among them, and Opossum is the most remote species from the remaining mammals. Our results are consistent with the results of the multiple alignment CLUSTAL X (Fig. 9(b)). Fig. 10(a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004, Zhang, 2009) and that of the multiple alignment in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses.

Conclusion

Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. Graphical representation is one of widely used alignment-free methods to view, sort, and compare biological sequences. This work presented a novel method to represent and compare biological sequence. In contrast to the existing graphical representations, we used the whole distribution of the dual bases to map a biological sequence into a closed curve in polar coordinates. Then we transformed the closed curve into a curve tree instead of the distance matrix, and proposed a tree matching distance to estimate the similarity of two biological sequences. To compare the effectiveness of the proposed method, we performed extensive tests including similarity of biological sequences and phylogenetic analysis, and compared its performance with alignment-based method. The results demonstrate that the proposed method is efficient, which highlight the necessity for graphical representation method to consider the whole distribution of the dual bases. Thus, this understanding can then be used to guide development of more powerful graphical representation for biological sequence comparison.
  23 in total

1.  On the similarity of DNA primary sequences.

Authors:  M Randić; M Vracko
Journal:  J Chem Inf Comput Sci       Date:  2000 May-Jun

2.  On the characterization of DNA primary sequences by triplet of nucleic acid bases.

Authors:  M Randić; X Guo; S C Basak
Journal:  J Chem Inf Comput Sci       Date:  2001 May-Jun

3.  2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids.

Authors:  Zhi-Cheng Wu; Xuan Xiao; Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2010-08-07       Impact factor: 2.691

4.  Bilateral similarity function: a novel and universal method for similarity analysis of biological sequences.

Authors:  Shiyuan Wang; Fengchun Tian; Yu Qiu; Xiao Liu
Journal:  J Theor Biol       Date:  2010-04-21       Impact factor: 2.691

5.  Invariants of DNA sequences based on 2DD-curves.

Authors:  Yusen Zhang; Wei Chen
Journal:  J Theor Biol       Date:  2006-05-02       Impact factor: 2.691

6.  A simple way to look at DNA.

Authors:  M A Gates
Journal:  J Theor Biol       Date:  1986-04-07       Impact factor: 2.691

7.  A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors:  S B Needleman; C D Wunsch
Journal:  J Mol Biol       Date:  1970-03       Impact factor: 5.469

8.  Three 3D graphical representations of DNA primary sequences based on the classifications of DNA bases and their applications.

Authors:  Guosen Xie; Zhongxi Mo
Journal:  J Theor Biol       Date:  2010-10-20       Impact factor: 2.691

9.  TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications.

Authors:  Jia-Feng Yu; Xiao Sun; Ji-Hua Wang
Journal:  J Theor Biol       Date:  2009-08-11       Impact factor: 2.691

10.  New 3D graphical representation of DNA sequence based on dual nucleotides.

Authors:  Xiao-Qin Qi; Jie Wen; Zhao-Hui Qi
Journal:  J Theor Biol       Date:  2007-09-01       Impact factor: 2.691

View more
  2 in total

1.  3D representations of amino acids-applications to protein sequence comparison and classification.

Authors:  Jie Li; Patrice Koehl
Journal:  Comput Struct Biotechnol J       Date:  2014-09-06       Impact factor: 7.271

2.  A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector.

Authors:  Marwa A Abd Elwahaab; Mervat M Abo-Elkhier; Moheb I Abo El Maaty
Journal:  Biomed Res Int       Date:  2019-05-08       Impact factor: 3.411

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.