Literature DB >> 32226086

Coronavirus phylogeny based on triplets of nucleic acids bases.

Bo Liao1, Yanshu Liu1, Renfa Li1, Wen Zhu1.   

Abstract

We considered the fully overlapping triplets of nucleotide bases and proposed a 2D graphical representation of protein sequences consisting of 20 amino acids and a stop code. Based on this 2D graphical representation, we outlined a new approach to analyze the phylogenetic relationships of coronaviruses by constructing a covariance matrix. The evolutionary distances are obtained through measuring the differences among the two-dimensional curves.
Copyright © 2006 Elsevier B.V. All rights reserved.

Entities:  

Year:  2006        PMID: 32226086      PMCID: PMC7094651          DOI: 10.1016/j.cplett.2006.01.030

Source DB:  PubMed          Journal:  Chem Phys Lett        ISSN: 0009-2614            Impact factor:   2.328


Introduction

Compilation of DNA primary sequence data continues unabated and tends to overwhelm us with voluminous outputs that increase daily. Comparison of primary sequences of different DNA strands remains one of the important aspect of the analysis of DNA data banks. Mathematical analysis of the large volume genomic DNA sequence data is one of the challenges for bio-scientists. There are three class methods for the analysis of DNA sequences: (i) Alignment [1], [2]. (ii) Matrices: (1) matrices in which an individual entry corresponds to an individual pair of bases [3], [6], [7] and (2) matrices in which entries summarize information of different X–Y pairs of bases [4], [5], [7]. (iii) Graphical representation: Graphical representation of DNA sequence provides a simple way of viewing, sorting and comparing various gene structures. Graphical techniques have emerged as a very powerful tool for the visualization and analysis of long DNA sequences. These techniques provide useful insights into local and global characteristics and the occurrences, variations and repetition of the nucleotides along a sequence which are not as easily obtainable by other methods. In recent years several authors outlined different graphical representation of DNA sequences based on 2D, 3D or 4D [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. Based on these graphical representation, several authors outlined some approaches to make comparison of DNA sequences [21], [22], [23], [24], [25]. All this methods are based on the (four letter alphabet, A, C, G, and T standing for nucleotide bases adenine, cytosine, guanine, and thymine, respectively). We will change to consider the fully overlapping triplets of nucleotide bases. Consideration of triplets of nucleotide bases instead of individual nucleotide bases has several reasons and advantages. There are three of them: (i) The genetic code consists of triplets (codons) of DNA (or RNA in some virus) nucleotides. (ii) The second advantage is that one can easily find the open reading frame as the longest sequence of triplets that contains no stop codons when read in a single reading frame. (iii) The computation will become more simple. In this Letter, we proposed a 2D graphical representation of the protein sequences consisting of 20 amino acids and a stop code. Based on this 2D graphical representation, we outlined a new approach to analyze the phylogenetic relationships of coronaviruses. The evolutionary distances are obtained through measuring the differences among the two-dimensional curves. Unlike most existing phylogeny construction methods [26], [27], [28], [29], [30], [31], the proposed method does not require multiple alignment.

2D graphical representation of protein sequences and properties

As is known, all of the 64 triplets of nucleotide bases correspond 20 amino acids and a stop code. There are three reading frame start at position 1, 2 and 3, respectively. Using the translate tool, we can obtain three protein sequences consisting of 20 amino acids and a stop code. The 20 amino acids found in proteins can be grouped according to the chemistry of their R groups as in [32]: amino acids A,V,F,P,M,I,L belong to the hydrophobic chemical group; amino acids D,E,K,R belong to charged chemical group; amino acids S,T,Y,H,C,N,Q,W belong to polar chemical group; amino acid belong to glycine chemical group. Then for any DNA sequence, we will transform it into three new sequences defined over alphabet . The rule is as follows: As shown in Fig. 1 , we construct a pyrimidinepurine graph on two quadrants of the cartesian coordinate system, with pyrimidines ( and ) in the first quadrant and purines ( and ) in the fourth quadrant. The unit vectors representing four alphabets and are as follows:where m is a real number and , n is a positive real number but not a perfect square number. So that we will reduce a DNA sequence into a series of nodes P 0,P 1,P 2, … ,P ⌊, whose coordinates x , y (i  = 0,1, 2, … , ⌊N/3⌋, where N is the length of the DNA sequence being studied) satisfy and satisfy
Fig. 1

Pyrimidine–purine graph.

Pyrimidinepurine graph. where A ,V ,F ,P ,M ,I ,L ,D ,E ,K ,R ,S ,T ,Y ,H ,C , N Q W ,G ,Ω ; are the cumulative occurrence numbers of A, V,F,P,M,I,L,D,E,K,R,S,T,Y,H,C,N,Q,W,G and −(or stop code), respectively, in the subsequence from the 1st base to the ith base in the sequence. And s ,k  = 1, … ,17 are positive real number but not perfect square number, s  ≠  s ,i,j  = 1, … ,17, and . We define A 0  =  V 0  =  F 0  =  P 0  =  M 0  =  I 0  =  L 0  =  D 0  =  E 0  =  K 0  =  R 0  =  S 0  =  T 0  =  Y 0  =  H 0  =  C 0  =  N 0  =  Q 0  =  W 0  =  G 0  =  Ω 0  = 0. We called the corresponding plot set be characteristic plot set. The curve connected all plots of the characteristic plot set in turn is called characteristic curve, which is determined by m, n, that satisfy above mentioned condition. In Fig. 2, Fig. 3, Fig. 4 , we show the SARS corresponding curves with different parameters n and m, where s 1  = 2/3;s 2  = 3/4; s 3  = 4/5;s 4  = 5/6;s 5  = 6/7;s 6  = 7/8;s 7  = 8/9;s 8  = 9/10;s 9  = 10/11; s 10  = 11/12;s 11  = 12/13;s 12  = 13/14;s 13  = 14/15;s 14  = 15/16;s 15  =  16/17;s 16  = 17/18;s 17  = 18/19. Observing Fig. 2, Fig. 3, Fig. 4, we find SARS have similar curves despite with different parameters n and m. Obviously, x and y are irrational numbers of form , where s and k are integers. We supposethen we haveSo, for given x-projection and y-projection of any point P  = (x, y) on the sequence, after uniquely determining s ,k ,s ,k from x and y, the number A ,V ,F ,P ,M ,I ,L ,D ,E ,K ,R ,S ,T ,Y ,H ,C ,N ,Q ,W ,G ,Ω of A,V,F,P,M,I,L,D,E,K,R,S,T,Y,H,C,N,Q,W,G and −(or stop code) from the beginning of the sequence to the point P can be found by solving linear system (2), (4).
Fig. 2

SARS corresponding curve with different parameters n and m based on the first reading frame.

Fig. 3

SARS corresponding curve with different parameters n and m based on the second reading frame.

Fig. 4

SARS corresponding curve with different parameters n and m based on the third reading frame.

For a given DNA sequence there are three 2D representations corresponding to it. Using the translate tool, one can obtain three protein sequences consisting of 20 amino acids and a stop code corresponding three reading frame start at position 1, 2 and 3. In a single reading frame, let (x , y ) be the coordinates of the ith amino acid of protein sequence, then we havei.e.,  □ SARS corresponding curve with different parameters n and m based on the first reading frame. SARS corresponding curve with different parameters n and m based on the second reading frame. SARS corresponding curve with different parameters n and m based on the third reading frame. The vector pointing to the point P from the origin O is denoted by r . The component of r , i.e. x and y are calculated by Eqs. (1), (2). Let Δr  =  r  −  r , then we have Property 2. For any i  = 1, 2, … ,  N′, where N′is the length of protein sequence corresponding the studied DNA sequence, the vector Δr has only twenty one possible direction. Furthermore, the length of Δr , i.e.,∣Δ r ∣, is always equal to s (m 2  +  n), for any i  = 1, 2, … ,  N, k  = 0,1, … ,17,s 0  = 1. Actually, the components of Δr , i.e., Δx and Δy can be calculated for each possible residue (A, V,F,P,M,I,L,D,E,K,R,S,T,Y,H,C,N,Q,W,G and −) at the ith position of the protein sequence by using Eqs. (1), (2). For example, when the ith residue is A, we find Δx  =  m and . This result is independent of the conformation state of the (i  − 1)th residue. The two numbers are called the direction of Δr . The direction number and the length of Δr for each possible residue type at the ith position are summarized. □ There is no circuit or degeneracy in our two-dimensional graphical representation. We assume that: (1) the number of amino acid forming a circuit is l; (2) the number of A,V,F,P,M,I,L,D,E,K,R,S, T,Y,H,C,N,Q,W,G and −(or stop code) in a circuit is a′,v′,f′,p′,m′,i′,l′,d′,e′,k′,r′,s′,t′,y′,h′,c′,n′,q′,w′,g′ and δ′, respectively. So a′ +  v′ +  f′ +  p′ +  m′ +  i′ +  l′ +  d′ + e′ +  k′ +  r′ +  s′ +  t′ +  y′ +  h′ +  c′ +  n′ +  q′ +  w′ +  g′ +  δ′ =  l. Because a′A,v′V,f′F,p′P,m′M,i′I,l′L,d′D,e′E,k′K,r′R,s′S,t′T,y′Y,h′H,c′C,n′N,q′Q,w′W,g′G and δ′ −(or stop code) form a circuit, the following equation holds: i.e.,Clearly Eqs. (5), (6) hold if, and only if a′ =  v′ =  f′ =  p′ =  m′ =  i′ =  l′ =  d′ = e′ =  k′ =  r′ =  s′ =  t′ =  y′ =  h′ =  c′ =  n′ =  q′ =  w′ =  g′ =  δ′ = 0. Therefore, l  = 0, which means no circuit exists in this graphical representation. □ The 2D representation possesses the reflection symmetry. usually the sequence is expressed in the order from 5′ to 3′. Suppose that the 2D representation for protein sequence is described by (x , y ),i  = 0,1, 2, … ,  N. Suppose again that the 2D representation for the reverse sequence, i.e, the same sequence but from 3′ to 5′ is described by , we find  □

Phylogenetic tree of coronaviruses

For any DNA sequence, we have three translating protein sequences. For any protein sequence, we have a set of points (x , y ),i  =  1,2,3, … ,N, where N is the length of the sequence. The coordinates of the geometrical center of the points, denoted by x 0 and y 0, may be calculated as follows:The element of covariance matrix CM of the points are defined:(See Table 1 )The above four numbers give a quantitative description of a set of point (x , y ),i  = 1, 2, … ,  N, scattering in a two-dimensional space. Obviously, the matrix is a real symmetric 2 × 2 one. There is a leading eigenvalue for a matrix CM. So that there are three geometrical centers and three leading eigenvalue corresponding a DNA sequence. In Table 2 , we list the geometrical centers and leading eigenvalues belonging to 24 species with parameter (See Table 3 ).
Table 1

The accession number, abbreviation, name and length for the 24 coronavirus geneomes

No.AccessionAbbreviationGenomeLength (nt)
lNC_002645HCoV_229EHuman coronavirus 229E27 317
2NC_002306TGEVTransmissible gastroenteritis virus28 586
3NC_003436PEDVPorcine epidemic diarrhea virus28 033
4U00735BCoVMBovine coronavirus strain Mebus31 032
5AF391542BCoVLBovine coronavirus isolate BCoV-LUN31 028
6AF220295BCoVQBovine coronavirus Quebec31 100
7NC_003045BCoVBovine coronavirus31 028
8AF208067MHVMMurine hepatitis virus strain ML-1031 233
9AF101929MHV2Murine hepatitis virus strain 231 276
10AF208066MHVPMurine hepatitis virus strain Penn 97-131 112
11NC_001846MHVMurine hepatitis virus31 357
12NC_001451IBVAvian infectious bronchitis virus27 608
13AY278488BJ01SARS coronavirus BJ0129 725
14AY278741UrbaniSARS coronavirus Urbani29 727
15AY278491HKU-39849SARS coronavirus HKU-3984929 742
16AY278554CUHK-W1SARS coronavirus CUHK-W129 736
17AY282752CUHK-Su10SARS coronavirus CUHK-SulO29,736
18AY283794SIN2500SARS coronavirus Sin250029 711
19AY283795SIN2677SARS coronavirus Sin267729 705
20AY283796SIN2679SARS coronavirus Sin267929 711
21AY283797SIN2748SARS coronavirus Sin274829 706
22AY283798SIN2774SARS coronavirus Sin277429 711
23AY291451TW1SARS coronavirus TW129 729
24NC_004718TOR2SARS coronavirus29 751
Table 2

Twenty one possible direction

ΔxnΔyn∣Δrn
Am-nm2 + n
Dnmm2 + n
Snmm2 + n
Gmnm2 + n
Vms1s1ns1(m2 + n)
Fms2s2ns2(m2 + n)
Pms3s3ns3(m2 + n)
Mms4s4ns4(m2 + n)
Ims5s5ns5(m2 + n)
Lms6s6ns6(m2 + n)
Ens7ms7s7(m2 + n)
Kns8ms8s8(m2 + n)
Rns9ms9s9(m2 + n)
Tns10-ms10s10(m2 + n)
Yns11-ms11s11(m2 + n)
Hns12-ms12s12(m2 + n)
Cns13-ms13s13(m2 + n)
Nns14-ms14s14(m2 + n)
Qns15-ms15s15(m2 + n)
wns16-ms16s16(m2 + n)
ms17ns17s17(m2 + n)
Table 3

The geometric centers and three leading eigenvalues for each of the 24 coronavirus genomes

ix10y10x20y20x30y30λ1λ2λ3
12.5692e + 003−159.04392.5566e + 003−342.58732.6794e + 003389.82492.15202.23212.3707
22.8619e + 003−230.43092.8245e + 003−723.26052.9971e + 003128.99132.69992.83932.9157
32.8626e + 003−233.09322.8231e + 003−724.55532.9976e + 003130.51042.70342.83862.9178
42.8602e + 003−245.69892.8245e + 003−743.48982.9985e + 003133.27082.70562.84532.9209
52.8688e + 003−294.63792.8364e + 003−709.62453.0012e + 003146.38512.75192.85612.9158
62.6263e + 003415.13622.5204e + 003−204.50272.4666e + 003−516.94282.28172.08132.1269
72.8773e + 003−476.96582.8773e + 003−476.96582.9006e + 003−252.79942.89102.89102.7932
82.8902e + 003−446.89272.8902e + 003−446.89272.9139e + 003−227.75372.90042.90042.8179
92.8853e + 003−459.68623.0344e + 00382.54462.8912e + 003−273.71152.91462.97392.7829
102.8582e + 003−528.74283.0320e + 00334.94262.8807e + 003−253.28862.86972.98822.7408
112.5137e + 003−415.88542.6893e + 003244.24642.5817e + 003−222.86662.22712.32872.1831
122.7670e + 003−48.39962.7276e + 003−34.77592.8570e + 003524.75742.47052.57402.6849
132.7255e + 003−35.70802.8550e + 003526.49762.7646e + 003−43.80662.56982.68042.4654
142.7656e + 003−45.98372.7262e + 003−35.11512.8557e + 003528.01862.46752.57112.6821
152.7659e + 003−45.27752.7260e + 003−36.48892.8558e + 003530.01272.46802.57102.6828
162.7656e + 003−47.80042.7267e + 003−33.66282.8560e + 003527.42902.46802.57252.6838
172.7239e + 003−35.14262.8535e + 003527.33512.7632e + 003−45.27022.56692.67772.4630
182.7233e + 003−36.19212.8529e + 003527.25832.7627e + 003−45.42892.56572.67662.4620
192.7239e + 003−34.44342.8535e + 003527.81622.7633e + 003−45.27752.56672.67802.4630
202.7239e + 003−35.67072.8525e + 003525.52472.7621e + 003−43.27152.56782.67372.4587
212.7241e + 003−35.54252.8535e + 003527.22872.7634e + 003−45.57342.56752.67772.4636
222.7647e + 003−48.06842.7258e + 003−35.71842.8553e + 003523.70992.46612.57002.6815
232.7647e + 003−47.84212.7252e + 003−35.82632.8547e + 003524.89102.46612.56922.6808
242.6110e + 003−251.10682.7585e + 003459.31752.6727e + 003−97.02352.35732.45872.3322
The accession number, abbreviation, name and length for the 24 coronavirus geneomes Twenty one possible direction The geometric centers and three leading eigenvalues for each of the 24 coronavirus genomes In order to facilitate the quantitative comparison of different species in terms of their collective parameters, we introduce a distance scale as defined below. Suppose that there are two species i and j, the parameters are , respectively, where are the three leading eigenvalues of matrix CM corresponding to species i. The distance d between the two points iswhere d denotes the distance between the geometric centers of the ith and the jth genomes, and M is the total number of all genomes (M  = 24, here). Then we obtain a real M  ×  M symmetric matrix whose elements are d . Accordingly, a real symmetric M  ×  M matrix D is obtained and used to reflect the evolutionary distance between the species i and j. The clustering tree is constructed using the UPGMA method in Phylip package (http://evolution.genetics.washington.edu/phylip.html). The final phylogenetic tree is drawn using the Drawgram program in the Phylip package. In Fig. 5 , we present the phylogenetic tree belonging to 24 species.
Fig. 5

Phylogenetic tree.

Phylogenetic tree.

Conclusion

We made a analysis of DNA sequences by considering the fully overlapping triplets of nucleotide bases. The presented graphical representation can be recaptured mathematically without loss of textual information. And our representation provides a direct plotting method to denote DNA sequences without degeneracy. Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g., whole genome phylogeny, and the evolutionary models may not always be correct. The current two-dimensional graphical representation of DNA sequences provides different approach for constructing phylogenetic tree. Unlike most existing phylogeny construction methods, the proposed method does not require multiple alignment. Also, both computational scientists and molecular biologists can use it to analysis protein sequences efficiently. We can obtain some graphical representation of protein sequence based on 2D, 3D and 4D using the following transform: . and satisfy Eq. (2). a ,c ,g and t are the cumulative occurrence numbers of A, C, G and T, respectively, in the subsequence from the 1st base to the ith base in the sequence.
  14 in total

1.  On the similarity of DNA primary sequences.

Authors:  M Randić; M Vracko
Journal:  J Chem Inf Comput Sci       Date:  2000 May-Jun

2.  Condensed representation of DNA primary sequences.

Authors:  M Randić
Journal:  J Chem Inf Comput Sci       Date:  2000 Jan-Feb

3.  On 3-D graphical representation of DNA primary sequences and their numerical characterization.

Authors:  M Randić; M Vracko; A Nandy; S C Basak
Journal:  J Chem Inf Comput Sci       Date:  2000 Sep-Oct

4.  Graphical approach to analyzing DNA sequences.

Authors:  Bo Liao; Kequan Ding
Journal:  J Comput Chem       Date:  2005-11-15       Impact factor: 3.376

5.  Two-dimensional graphical representation of DNA sequences and intron-exon discrimination in intron-rich sequences.

Authors:  A Nandy
Journal:  Comput Appl Biosci       Date:  1996-02

6.  Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea.

Authors:  H Kishino; M Hasegawa
Journal:  J Mol Evol       Date:  1989-08       Impact factor: 2.395

7.  Simpler DNA sequence representations.

Authors:  M A Gates
Journal:  Nature       Date:  1985 Jul 18-24       Impact factor: 49.962

8.  A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.

Authors:  M Kimura
Journal:  J Mol Evol       Date:  1980-12       Impact factor: 2.395

9.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances.

Authors:  J A Lake
Journal:  Proc Natl Acad Sci U S A       Date:  1994-02-15       Impact factor: 11.205

10.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.

Authors:  E Hamori; J Ruskin
Journal:  J Biol Chem       Date:  1983-01-25       Impact factor: 5.157

View more
  1 in total

Review 1.  Graphical representation and mathematical characterization of protein sequences and applications to viral proteins.

Authors:  Ambarnil Ghosh; Ashesh Nandy
Journal:  Adv Protein Chem Struct Biol       Date:  2011       Impact factor: 3.507

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.