Literature DB >> 32327765

A novel way to numerically characterize DNA sequences and its application.

Ying Guo1, Yan-Fang Wang1, Sheng-Li Zhang2.   

Abstract

We presented a novel way to numerically characterize DNA sequences based on the graphical representation for the sequences comparison and analysis. Instead of calculating the leading eigenvalues of the matrix for graphical representation, we computed curvature and torsion of curves as the descriptor to numerically characterize DNA sequences. The new method was tested on three data sets: the coding sequences of β-globin gene, all of their exons, and 24 coronavirus geneomes from NCBI. The similarities/dissimilarities and phylogenetic tree of these species verify the validity of our method. © 2010 Wiley Periodicals, Inc. Int J Quantum Chem, 2011.
Copyright © 2010 Wiley Periodicals, Inc.

Entities:  

Keywords:  curvature; graphical representation; phylogenetic tree; torsion

Year:  2010        PMID: 32327765      PMCID: PMC7168550          DOI: 10.1002/qua.22872

Source DB:  PubMed          Journal:  Int J Quantum Chem        ISSN: 0020-7608            Impact factor:   2.444


1. Introduction

Graphical techniques initiated in 1983 by Hamori and Ruskin 1 have emerged as a very powerful tool for the visualization and analysis of long DNA sequences. Several authors outlined different 2D graphical representations of DNA sequences based on two dimensional Cartesian coordinates. The original plot of a DNA sequence as a random walk on a 2D grid using the four cardinal directions to represent the four bases was done by Gates 2, Leong and Morgenthaler 3, and Nandy 4. Their method is based on the assignment of the four bases of DNA sequences to the four directions of the (x, y) coordinate system. These 2D graphical representations of DNA sequence provide useful insights into local and global characteristics and the occurrences, variations and repetition of the nucleotides along a sequence that are not easily observed from DNA sequences directly. However, these graphical representations are accompanied with some loss of information because of overlapping and crossing of the curve representing DNA with itself. To eliminate, or at least reduce the degeneracy of the above graphical representations, many high orders and unique graphical representations have been proposed 5, 6, 7, 8, 9, 10, 11. In recent years, based on existing graphical representation, several authors have presented various methods to assign mathematical descriptors to DNA sequences to quantitatively compare the sequences and determine similarities and dissimilarities between them 8, 9, 10, 12, 13, 14, 15, 16. In particular, the leading eigenvalues of the L/L matrices have been considered to be good descriptors of DNA sequences. However, the computation of the leading eigenvalues of the L/L matrices for long DNA sequences will be expensive. Therefore, the emergence of research into mathematical descriptors of DNA sequences is apparent and necessary. Motivated by searching an efficient descriptor of DNA sequences, we propose a novel way to numerically characterize DNA sequences. When a DNA sequence is mapped into a 3D space, we can obtain a curve. Then, the curvature and torsion of the curve are computed to numerically characterize DNA sequences. The proposed numerical characterizations are tested by similarity analysis and phylogenetic analysis on three different data sets. Our results show that our method is preferable to numerically characterize DNA sequences. Furthermore, our method is rapid because the whole process does not involve complex algorithm.

2. Materials and Methods

2.1. 3D GRAPHICAL REPRESENTATION

Yuan et al. 7 proposed a 3D graphical representation that assigns one nucleotide base as follows: That is to say, A, G, T, and C are assigned to −x, +x, −y, and +y, respectively, while the corresponding curve extend along with z‐axes. In detail, for a given DNA sequence G = g 1 g 2…g …, inspect it by stepping one base at a time. For the step i (i = 1, 2,…,N), a 3D space point P (x , y ,z ) can be constructed by function ϕ(g ) as follows: where N is the length of the given DNA sequence. When i runs from 1 to N, we have points P 1 (x 1, y 1,z 1),P 2 (x 2, y 2,z 2),…,P (x , y ,z ). Connecting adjacent points, we obtain a 3D zigzag curve. For example, the 3D graphical representation of the sequence ATGGTGCACC is presented in Figure 1.
Figure 1

The 3‐D graphical representation of the sequence ATGGTGCACC.

The 3‐D graphical representation of the sequence ATGGTGCACC. According to the method of the graphical representation, there are three curves corresponding to the same DNA sequence. If we assign the four nucleotide bases as follows: we will get the second 3D curve. For the same sequence, ATGGTGCACC, the graph of the second curve is shown in Figure 2.
Figure 2

The 3‐D graphical representation of the sequence ATGGTGCACC.

The 3‐D graphical representation of the sequence ATGGTGCACC. The third curve will be gotten by assigning the four nucleotide bases as follows: After having three curves corresponding to the same DNA sequence, we conveniently denote them as the curves of the patterns AGTC, ATCG, and ACGT.

2.2. THE CURVATURE AND TORSION OF THE CURVE

The most fundamental characteristics of a curve are its curvature and torsion, so we regard the curvature and torsion of curves as the descriptors to numerically characterize curve of DNA sequences. The zigzag curve from the graphical representation of Yuan et al. is not smooth. In this section, we introduce a new method to calculate curvature and torsion of unsmooth curves. Based on the reference 17, let ▵ be the difference operator, which assigns to every function f(x), the function g = ▵f, which is defined by g(x) = f(x + 1) − f(x). For each integer n ≥ 2, we define ▵ f = ▵(▵ f), and we denote ▵ f(x) instead of (▵ f)(x). Then, we have: So we can get the first to third difference form as below: Then the three curves are obtained, denoted by For the i‐th curve, its curvature k and torsion τ can be calculated by the following formula 18. If t is equal to t 0, the curvature and torsion values are Give a DNA sequence with length of N, N curvature and torsion values will be obtained. The average curvature and torsion values of the N curvature and torsion values, denoted by k and τ respectively, can be computed as: As the curvature and torsion are character of the curve, they, in turn, can be regarded as descriptors to numerically characterize the curve. For extracting more characters from sequence, we construct a six‐component vector, which is composed of three average curvatures and three average torsions, for numerical characterization the DNA sequence.

3. Results and Discussions

3.1. SIMILARITY ANALYSIS

Comparison of different DNA sequences is main application of our method. In Table I, the coding sequences of β‐globin genes of 11 species and their exons are presented. Table II shows the six‐component vectors of the coding sequences of the β‐globin genes of 11 species.
Table I

The accession numbers, length, and location for each β‐globin genes and their exons

SpeciesDatabaseIDLocationLength (bp)Location of each exon
1 HumanNCBIU0131762187–63610142462187···62278, 62409··· 62631, 63482··· 63610
2 ChimpanzeeNCBIX023454189–553213444189···4293, 4412···4633, 5484···5532
3 GorillaNCBIX611094538–588113444538···4630, 4761···4982, 5833···5881
4 LemurNCBIM15734154–15951442154···245, 376···598, 1467···1595
5 RatNCBIX06701310–15051196310···401, 517···739, 1377···1505
6 MouseNCBIV00722275–14621188275···367, 484···705, 1334···1462
7 GoatNCBIM15387279–17491471279···364, 493···715, 1621···1749
8 BovineNCBIX00376278–17411464278···363, 492···714, 1613···1741
9 RabbitNCBIV00882277–14191143277···368, 495···717, 1291···1419
10 OpossumNCBIJ03643467–24882022467···558, 672···894, 2360···2488
11 GallusNCBIV00409465–18101346465···556, 649···871, 1682···1810
Table II

The six‐components vectors of coding sequence of the β‐globin gene of 11 species

PatternAGTC (k)AGTC (τ)ATCG (k)ATCG (τ)ACGT (k)ACGT (τ)
Human0.632000.050990.630720.031320.623820.02744
Chimpanzee0.631120.054980.629410.038250.623280.02229
Gorilla0.629980.057510.628430.042630.621330.02117
Lemur0.614470.054740.613060.019290.613500.02304
Rat0.645990.037980.635020.044460.635770.04387
Mouse0.641880.050290.631450.029850.632600.04137
Goat0.645910.073250.637680.027510.637120.02328
Bovine0.648070.065770.640120.027220.637350.01937
Rabbit0.634390.049620.629550.015640.630160.02001
Opossum0.647080.045460.637610.005440.635920.03010
Gallus0.641630.074420.62966−0.000250.634360.05336
The accession numbers, length, and location for each β‐globin genes and their exons The six‐components vectors of coding sequence of the β‐globin gene of 11 species Having a vector representation of a DNA sequence, we can compare various sequences by using any of existing distance measures for vectors. The distance between two DNA sequences can be computed as the Euclidean distance between the end points of the two vectors representing them. The Euclidean distance between u and v is defined as: where u and v are vectors, u and v denote the six‐component of the vectors u and v, respectively. The underlying rationale is that if two vectors points in similar direction and the difference in their magnitudes is small, then the two sequences represented by these vectors are similar. In other words, the smaller the Euclidean distance between the end points of two vectors, the more similar are the two sequences represented by these vectors. Table III denotes the similarity matrix of the coding sequences of the β‐globin gene of 11 species. Following the same method, we can also get the similarity matrices of the coding sequences of the each exon of 11 species, which are represented in Tables IV, V, VI.
Table III

The similarity matrix of the coding sequences of the β‐globin gene of 11 species

SpeciesHumanChimp‐GorillaLemurRatMouseGoatBovineRabbitOpossumGallus
Human00.009650.015010.030070.031120.019290.030760.028810.018710.033600.04922
Chimp‐00.005740.031630.034670.025760.030480.029090.024550.041350.05531
Gorilla00.033080.037520.030020.032710.032090.029840.046880.05889
Lemur00.057620.043840.050630.051270.031540.049960.05601
Rat00.020270.044310.041260.041610.042150.05888
Mouse00.030570.029430.026910.028670.04048
Goat00.009070.030940.036170.04203
Bovine00.027310.031800.04631
Rabbit00.021960.04528
Opossum00.03883
Gallus0
Table IV

The similarity matrix of the coding sequences of the first exon of the β‐globin gene of 11 species

SpeciesHumanChimp‐GorillaLemurRatMouseGoatBovineRabbitOpossumGallus
Human00.045460.021830.153860.081960.103830.082940.061190.105420.105400.14157
Chimp‐00.044590.147790.111720.121630.117360.068490.127090.086310.13549
Gorilla00.153260.072600.095460.083380.058880.117570.107320.15115
Lemur00.178360.131710.171180.101320.118170.111560.08159
Rat00.062260.041670.087620.138940.139120.19241
Mouse00.064830.067980.122060.112490.16349
Goat00.083030.114110.130330.17459
Bovine00.088480.073620.10987
Rabbit00.122870.09962
Opossum00.10638
Gallus0
Table V

The similarity matrix of the coding sequences of the second exon of the β‐globin gene of 11 species

SpeciesHumanChimp‐GorillaLemurRatMouseGoatBovineRabbitOpossumGallus
Human00.004180.009920.053620.058050.059550.051340.047710.039770.051340.07491
Chimp‐00.008380.050610.054250.056260.053150.050230.039450.048320.07303
Gorilla00.051110.054250.056890.060280.056950.045300.047470.07298
Lemur00.036700.020350.090140.089250.056040.014260.03782
Rat00.026280.087800.091240.068040.033810.06576
Mouse00.092200.093820.066390.018140.04051
Goat00.014320.052310.090810.11111
Bovine00.046820.089760.10931
Rabbit00.059190.08091
Opossum00.03770
Gallus0
Table VI

The similarity matrix of the coding sequences of the third exon of the β‐globin gene of 11 species

SpeciesHumanChimp‐GorillaLemurRatMouseGoatBovineRabbitOpossumGallus
Human00.048460.090990.120160.091760.129100.110050.089750.091910.045140.09097
Chimp‐00.058470.138220.135760.148720.105180.105690.098400.067110.11851
Gorilla00.133080.170060.156320.089330.114270.106550.088870.14676
Lemur00.110370.067510.063500.051280.068780.079170.10755
Rat00.105770.139330.092000.106990.084130.07091
Mouse00.094280.064440.074370.092370.11379
Goat00.052930.047040.071620.11004
Bovine00.022600.050660.07055
Rabbit00.055870.07534
Opossum00.07596
Gallus0
The similarity matrix of the coding sequences of the β‐globin gene of 11 species The similarity matrix of the coding sequences of the first exon of the β‐globin gene of 11 species The similarity matrix of the coding sequences of the second exon of the β‐globin gene of 11 species The similarity matrix of the coding sequences of the third exon of the β‐globin gene of 11 species In Table III, for the coding sequences of the β‐globin gene of 11 species, it is obvious that the coding sequences of Gallus is the most dissimilar to the other 10 species, which is consistent with the fact that Gallus is non‐mammal, whereas the others are mammal. The more similar species pairs are HumanGorilla, HumanChimpanzee, RatMouse, and GorillaChimpanzee, which are consistent with the results obtained by Randic 5, 19 and B. Liao 20. In Tables IV, V, VI, for the single exon of the coding sequences of the β‐globin gene of 11 species, there are some flaws. Some entries are not better than that of Table III. To compare with other methods, we use the leading eigenvalues of E, L/L, M/M matrices 7 to perform the similarity analysis on the same data. The similarity for any pair of DNA sequences can be gotten by calculating the Euclidean distance between their leading eigenvalues. The similarity between Human and the other species are listed in Table VII. Table VII shows that our results are better than E, L/L, and M/M matrix. For example, Human is more similar to Chimpanzee and Gorilla in our method. But in E, L/L, and M/M matrix, Human is more similar to Lemur, which does not accord with the results reported in the references 5, 19, 20.
Table VII

The comparison similarity between Human and the other 10 species based on our method and Yuan's method

SpeciesChimp‐GorillaLemurRatMouseGoatBovineRabbitOpossumGallus
β gene
E(106)0.07700.07700.01790.20760.21420.04720.04010.25060.71590.0743
L/L0.11430.11690.01400.35840.37610.05030.04090.44420.69430.1193
M/M(103)0.08000.08000.01800.22790.23590.04710.04010.28100.59810.7914
My work 0.00965 0.01501 0.03007 0.03112 0.01929 0.03076 0.02881 0.01871 0.03360 0.04922
1st exon
E(103)0.89000.06420.00000.00010.12930.37110.37120.12660.00020.0001
L/L0.26470.01650.06720.01110.01070.14550.12750.07370.04970.0177
M/M13.01461.01750.10720.00322.03985.92946.03031.99190.17280.0229
My work 0.04546 0.02183 0.15386 0.08196 0.10383 0.08294 0.06119 0.10542 0.10540 0.14157
2nd exon
E(104)0.01550.01550.00000.00000.01550.00000.00000.00000.00000.0000
L/L0.00870.00840.00450.02270.01650.01110.00290.00760.00250.0115
M/M0.99900.99950.03940.05921.03800.09070.05240.05710.02630.1075
My work 0.00418 0.00992 0.05362 0.05805 0.05955 0.05134 0.04771 0.03977 0.05134 0.07491
3rd exon
E(103)4.94884.94880.00000.00000.00010.00010.00010.00000.00010.0001
L/L1.93371.93840.01500.01670.01160.00790.02040.01060.01130.0051
M/M80.058280.06640.04250.02310.07010.00570.02560.05990.09850.1431
My work 0.04846 0.09099 0.12016 0.09176 0.12910 0.11005 0.08975 0.09191 0.04514 0.09097
The comparison similarity between Human and the other 10 species based on our method and Yuan's method On the other hand, it is noteworthy that the eigenvalues of E, L/L, and M/M matrixes is computationally intensive. Its running times is 6.5‐times longer than that of our method. For example, in the β‐globin gene, the leading eigenvalues of E, L/L, and M/M matrixes take 2.103 h, and our method just 19.4 s, using a 1.41 GHZ, AMD with 512 MB total memory. It is obvious that our method performs faster.

3.2. PHYLOGENETIC ANALYSIS

Phylogenetics is the study of the evolutionary history among organisms. Moreover, it can provide information for function prediction. When sequences are grouped into families, it can provide us some clues about the general features of that family and evolutionary evidence of sequences. Given a set of DNA sequences, their phylogenetic relationship can be gotten through the following main operation: first, we calculate the numerical characterizations of DNA sequences and the Euclidean distance between these numerical characterizations. Second, by arranging all the distance into a matrix, we obtain a distance matrix. Finally, we put the distance matrix into the UPGMA program in the PHYLIP package. We obtain the phylogenetic tree drawn by Treeview program. To further demonstrate the utility of our method, we also analyze 24 coronavirus geneomes, which are listed in Table VIII. Recently, more attentions have been paid to atypical syndrome (SARS), which was first identified in Guangdong Province, China, and rapidly spread to several countries later. The research of the relationships between the SARS‐CoVs and the other coronaviruses can help to discover drags and develop vaccines against the virus. The phylogenetic tree for 24 coronavirus geneomes is constructed by using our method, which is presented in Figure 3. To indicate the validity, we also constructed an evolutionary tree by the Clustal X method. Clustal X is a multiple sequence alignment program. The result is shown in Figure 4.
Table VIII

The accession number, abbreviation, name and length for the 24 coronavirus geneomes

No.AccessionAbbreviationGenomeLength (bp)
1NC_002645HCoV_ 229EHuman coronavirus 229E27317
2NC_002306TGEVTransmissible gastroenteritis virus28586
3NC_003436PEDVPorcine epidemic diarrhea virus28033
4U00735BCoVMBovine coronavirus strain Mebus31032
5AF391542BCoVLBovine coronavirus isolate BCoV‐LUN31028
6AF220295BCoVQBovine coronavirus strain Quebec31100
7NC_003045BCoVBovine coronavirus31028
8AF208067MHVMMurine hepatitis virus strain ML‐1031233
9AF201929MHV2Murine hepatitis virus strain 231276
10AF208066MHVPMurine hepatitis virus strain Penn 97–131112
11NC_001846MHVMurine hepatitis virus strain A5931357
12NC_001451IBVAvian infectious bronchitis virus27608
13AY278488BJ01SARS coonavirus BJ0129725
14AY278741UrbaniSARS coronavirus Urbani29727
15AY278491HKU‐39849SARS coronavirus HKU‐3984929742
16AY278554CUHK‐W1SARS coronavirus CUHK‐W129736
17AY282752CUHK‐Su10SARS coronavirus CUHK‐SulO29736
18AY283794SIN2500SARS coronavirus Sin250029711
19AY283795SIN2677SARS coronavirus Sin267729705
20AY283796SIN2679SARS coronavirus Sin267929711
21AY283797SIN2748SARS coronavirus Sin274829706
22AY283798SIN2774SARS coronavirus Sin277429711
23AY291451TW1SARS coronavirus TW129729
24NC_004718TOR2SARS coronavirus29751
Figure 3

The phylogenetic tree for the 24 coronavirus geneomes based on our numerical characterization. The tree is constructed by the UPGMA method.

Figure 4

The phylogenetic tree for the 24 coronavirus geneomes by Clustal X.

The phylogenetic tree for the 24 coronavirus geneomes based on our numerical characterization. The tree is constructed by the UPGMA method. The phylogenetic tree for the 24 coronavirus geneomes by Clustal X. The accession number, abbreviation, name and length for the 24 coronavirus geneomes The topology of the tree obtained by our method (Fig. 3) is on the whole consistent with the established taxonomic groups, except for BCoVM and IBV. Coronaviruses can be divided into four groups according to serotypes. Group I (HCoV_229E, TGEV, and PEDV) and group II (BCoVL, BCoVQ, BCoV, MHVM, MHV2, MHVP, and MHV) contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus 21. Group III (IBV) contains only avian viruses, and Group IV 22, 23 are SARS‐CoVs. Compared with the results in Figures 3 and 4, we can find some difference. In Figure 4, the result of Group IV is not clear. All in all, our method gives a more intuitively acceptable arrangement compared with the result of Clustal X.

4. Conclusions

Sequence comparison, which aims to discover similarity relationships between molecular sequences, is a fundamental task in computational biology. Currently, it is mainly handled using alignments. With the biological sequences explosive increasing, the alignment methods seem inadequate for postgenomic studies. Therefore, other methods are actively pursued. In this article, we proposed a new method to numerically characterize DNA sequences and applied it to analyze the similarity of different sequences. Based on the 3D graphical representation, we calculated curvature and torsion in difference forms. Then, the curvature and torsion are regarded as the new descriptor to numerically characterize the DNA sequences. Avoiding the complexity of calculating the leading eigenvalues of the matrix for graphical representation, our method is more simple. Its application to the similarity/dissimilarity of the coding sequences of β‐globin gene of 11 species and each of the exons of the gene illustrates validity. Not only so, using our method we analyzed coronavirus genomes and constructed the phylogenetic tree. The result, that is consistent with previous analysis, shows that SARS‐CoVs form an independent group.
  14 in total

1.  On the characterization of DNA primary sequences by triplet of nucleic acid bases.

Authors:  M Randić; X Guo; S C Basak
Journal:  J Chem Inf Comput Sci       Date:  2001 May-Jun

2.  Characterization of a novel coronavirus associated with severe acute respiratory syndrome.

Authors:  Paul A Rota; M Steven Oberste; Stephan S Monroe; W Allan Nix; Ray Campagnoli; Joseph P Icenogle; Silvia Peñaranda; Bettina Bankamp; Kaija Maher; Min-Hsin Chen; Suxiong Tong; Azaibi Tamin; Luis Lowe; Michael Frace; Joseph L DeRisi; Qi Chen; David Wang; Dean D Erdman; Teresa C T Peret; Cara Burns; Thomas G Ksiazek; Pierre E Rollin; Anthony Sanchez; Stephanie Liffick; Brian Holloway; Josef Limor; Karen McCaustland; Melissa Olsen-Rasmussen; Ron Fouchier; Stephan Günther; Albert D M E Osterhaus; Christian Drosten; Mark A Pallansch; Larry J Anderson; William J Bellini
Journal:  Science       Date:  2003-05-01       Impact factor: 47.728

3.  New 2D graphical representation of DNA sequences.

Authors:  Bo Liao; Tian-Ming Wang
Journal:  J Comput Chem       Date:  2004-08       Impact factor: 3.376

4.  New invariant of DNA sequences.

Authors:  Chun Li; Jun Wang
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

5.  PNN-curve: a new 2D graphical representation of DNA sequences and its application.

Authors:  Xiao Qing Liu; Qi Dai; Zhilong Xiu; Tianming Wang
Journal:  J Theor Biol       Date:  2006-07-24       Impact factor: 2.691

6.  A novel 2D graphical representation of DNA sequences and its application.

Authors:  Qi Dai; Xiaoqing Liu; Tianming Wang
Journal:  J Mol Graph Model       Date:  2006-03-03       Impact factor: 2.518

7.  A new method to analyze the similarity of protein structure using TOPS representations.

Authors:  Ying Guo; Tian-ming Wang
Journal:  J Biomol Struct Dyn       Date:  2008-12

8.  The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform.

Authors:  Lianping Yang; Xiangde Zhang; Tianming Wang
Journal:  J Theor Biol       Date:  2009-11-10       Impact factor: 2.691

9.  A simple way to look at DNA.

Authors:  M A Gates
Journal:  J Theor Biol       Date:  1986-04-07       Impact factor: 2.691

10.  Random walk and gap plots of DNA sequences.

Authors:  P M Leong; S Morgenthaler
Journal:  Comput Appl Biosci       Date:  1995-10
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.