Literature DB >> 16752365

Coronavirus phylogeny based on 2D graphical representation of DNA sequence.

Abstract

A novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome (SARS). Previous phylogenetic analyses based on sequence alignments show that SARS-CoVs form a new group distantly related to the other three groups of previously characterized coronaviruses. In this aritcle, a new approach based on the 2D graphical representation of the whole genome sequence is proposed to analyze the phylogenetic relationships of coronaviruses. The evolutionary distances are obtained through measuring the differences among the two-dimensional curves. Copyright 2006 Wiley Periodicals, Inc.

Entities: Chemical Disease Species

Mesh：

Substances：
DNA

Year: 2006 PMID： 16752365 PMCID： PMC7167161 DOI： 10.1002/jcc.20439

Source DB: PubMed Journal: J Comput Chem ISSN： 0192-8651 Impact factor: 3.376

Introduction

The outbreak of atypical pneumonia, referred to as severe acute respiratory syndrome (SARS) was first identified in Guangdong Province, China, and spread to several countries later. A novel coronavirus was isolated and found to be the cause of SARS. The SARS‐coronavirus is a new member of the order Nidovirales, family Coronaviridae, and genus Coronavirus. Some researchers have considered the mutation analysis and phylogenetic analysis.1, 2, 3, 4, 5, 6 Phylogenetic analysis using biological sequences can be divided into two groups. The algorithms in the first group calculate a matrix representing the distance between each pair of sequences and then transform this matrix into a tree. In the second type of approaches, instead of building a tree, the tree that can best explain the observed sequences under the evolutionary assumption is found by evaluating the fitness of different topologies. For example, Jukes and Cantor,7 Kimura,8 Barry and Hartigan,9 Kishino and Hasegawa,10 and Lake11 proposed various distance measures. Camin and Sokal,12 Eck and Dayhoff,13 Cavalli‐Sforza and Edwards,14 and Fitch15 gave parsimony methods. Felsenstein et al.16, 17, 18 proposed maximum likelihood methods. But, all of these methods require a multiple alignment of the sequences and assume some sort of an evolutionary model. In addition to problems in multiple alignment (computational complexity and inherent ambiguity of the alignment cost criteria), these methods become insufficient for phylogenies using complete genomes. Multiple alignment become misleading due to gene rearrangement, inversion, transposition, and translocation at the substring level, unequal length of sequences, etc, and statistical evolutionary models are yet to be suggested for complete genomes. On the other hand, whole genome‐based phylogenic analyses are appearing because single gene sequences generally do not possess enough information to construct an evolutionary history of organisms. Factors such as different rates of evolution and horizontal gene transfer make phylogenetic analysis of species using single gene sequences difficult. Mathematical analysis of the large volume genomic DNA sequence data is one of the challenges for bioscientists. Graphical representation of DNA sequence provides a simple way of viewing, sorting, and comparing various gene structures. In recent years several authors outlined different graphical representation of DNA sequences based on 2D, 3D, or 4D.19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 Graphical techniques have emerged as a very powerful tool for the visualization and analysis of long DNA sequences. These techniques provide useful insights into local and global characteristics and the occurrences, variations, and repetition of the nucleotides along a sequence that are not as easily obtainable by other methods.29, 33 Based on these graphical representation several authors outlined some approaches to make comparison of DNA sequences34, 35, 36, 37, 38. Recently, we present a new two‐dimensional graphical representation of DNA sequences, which has no circuit or degeneracy.19 Here, a new approach based on the 2D graphical representation of the whole genome sequence is proposed to analyze the phylogenetic relationships of genomes. The evolutionary distances are obtained through measuring the differences among the 2D curves. The examination of the phylogenetic relationships of coronaviruses illustrates the utility of our approach.

2D Graphical Representation of DNA Sequences

As shown in Figure 1, which is similar with Yan's34 method, we construct a pyrimidine–purine graph on two quadrants of the cartesian coordinate system, with pyrimidines(T and C) in the first quadrant and purines(A and G) in the fourth quadrant. The unit vectors representing four nucleotides A,G,C, and T are as follows: where m is a real number, n is a positive real number but not a perfect square number. Using this representation, we will reduce a DNA sequence into a series of nodes P 0,P 1,P 2,…,P , whose coordinates x, y(i = 0,1,2,…, N, where N is the length of the DNA sequence being studied) satisfy where a,c,g and t are the cumulative occurrence numbers of A, C, G, and T, respectively, in the subsequence from the first base to the i‐th base in the sequence. We define a 0 = c = g = t = 0.

Figure 1

Pyrimidine–purine graph.

Pyrimidine–purine graph. We called the corresponding plot set a characteristic plot set. The curve connecting all plots of the characteristic plot set, in turn, is called the characteristic curve, which is determined by m,n, that satisfy the above mentioned condition. In Figure 2, we show the chimpanzee corresponding curves with different parameters n and m. Observing Figure 2, we find that chimpanzees have similar curves despite corresponding different parameters of n and m. They have the same tendency despite different lengths. In Figure 3, we present the 2D curves for 24 complete coronavirus genomes (see Table 1) with parameters n = 1/2 and m = 3/4 chosen initially by Yan et al.34

Figure 2

The chimpanzee corresponding curves with different parameters n and m.

Figure 3

(A) IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV‐229E complete genome. (B) MHV2, MHV, MHVM, MHVP, PEDV, TGEV complete genome. (C) BJ01, CUHK‐Su10, CUHK‐W1, SIN2679, SIN2748, SIN2774 complete genome. (D) HKU‐39849, SIN2500, SIN2677, TW1, Urbani, TOR2 complete genome. The two‐dimensional curves for 24 complete coronavirus genomes. (A–D) The curves of IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV‐229E, MHV2, MHV, MHVM, MHVP, PEDV, TGEV, BJ01, CUHK‐Su10, CUHK‐W1, SIN2679, SIN2748, SIN2774, HKU‐39849, SIN2500, SIN2677, TW1, Urbani, and TOR2, respectively. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.]

Table 1

The Accession Number, Abbreviation, Name, and Length for the 24 Coronavirus Genomes.

No.	Accession	Abbreviation	Genome	Length(nt)
1	NC_002645	HCoV_229E	Human coronavirus 229E	27,317
2	NC_002306	TGEV	Transmissible gastroenteritis virus	28,586
3	NC_003436	PEDV	Porcine epidemic diarrhea virus	28,033
4	U00735	BCoVM	Bovine coronavirus strain Mebus	31,032
5	AF391542	BCoVL	Bovine coronavirus isolate BCoV‐LUN	31,028
6	AF220295	BCoVQ	Bovine coronavirus Quebec	31,100
7	NC_003045	BCoV	Bovine coronavirus	31,028
8	AF208067	MHVM	Murine hepatitis virus strain ML‐10	31,233
9	AF101929	MHV2	Murine hepatitis virus strain 2	31,276
10	AF208066	MHVP	Murine hepatitis virus strain Penn 97‐1	31,112
11	NC_001846	MHV	Murine hepatitis virus	31,357
12	NC_001451	IBV	Avian infectious bronchitis virus	27,608
13	AY278488	BJ01	SARS coronavirus BJ01	29,725
14	AY278741	Urbani	SARS coronavirus Urbani	29,727
15	AY278491	HKU‐39849	SARS coronavirus HKU‐39849	29,742
16	AY278554	CUHK‐W1	SARS coronavirus CUHK‐W1	29,736
17	AY282752	CUHK‐Su10	SARS coronavirus CUHK‐Su10	29,736
18	AY283794	SIN2500	SARS coronavirus Sin2500	29,711
19	AY283795	SIN2677	SARS coronavirus Sin2677	29,705
20	AY283796	SIN2679	SARS coronavirus Sin2679	29,711
21	AY283797	SIN2748	SARS coronavirus Sin2748	29,706
22	AY283798	SIN2774	SARS coronavirus Sin2774	29,711
23	AY291451	TW1	SARS coronavirus TW1	29,729
24	NC_004718	TOR2	SARS coronavirus	29,751

The chimpanzee corresponding curves with different parameters n and m. (A) IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV‐229E complete genome. (B) MHV2, MHV, MHVM, MHVP, PEDV, TGEV complete genome. (C) BJ01, CUHK‐Su10, CUHK‐W1, SIN2679, SIN2748, SIN2774 complete genome. (D) HKU‐39849, SIN2500, SIN2677, TW1, Urbani, TOR2 complete genome. The two‐dimensional curves for 24 complete coronavirus genomes. (A–D) The curves of IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV‐229E, MHV2, MHV, MHVM, MHVP, PEDV, TGEV, BJ01, CUHK‐Su10, CUHK‐W1, SIN2679, SIN2748, SIN2774, HKU‐39849, SIN2500, SIN2677, TW1, Urbani, and TOR2, respectively. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.] The Accession Number, Abbreviation, Name, and Length for the 24 Coronavirus Genomes. Observing Figure 3, we find that the curves of BCoV, BCoVL, BCoVM, and BCoVQ have some similar tendencies. The curves of MHV2, MHV, MHVM, and MHVP have some similar tendencies. The curves of BJ01, CUHK‐Su10, CUHK‐W1, SIN2679, SIN2748, SIN2774, HKU‐39849, SIN2500, SIN2677, TW1, Urbani, and TOR2 have some similar tendencies.

Phylogenetic Tree of Coronaviruses

For any sequence, we have a set of points (x, y), i = 1,2,3,…, N, where N is the length of the sequence. The coordinates of the geometrical center of the points, denoted by x 0 and y 0, may be calculated as follows29 The element of the covariance matrix CM of the points are defined: The above four numbers give a quantitative description of a set of point (x, y), i = 1,2,…, N, scattering in a 2D space. Obviously, the matrix is a real symmetric 2 × 2 one. The eigenvectors and their associated eigenvalues are defined as follows: Corresponding to each eigenvalue λ , there's an eigenvector EV. Corresponding to λ1 < λ2, the two eigenvectors are denoted by EV , EV , respectively. In Table 2, we list the (x 0, y 0) and eigenvectors belonging to 24 species with parameters .

Table 2

The Geometric Center and Two Eigenvectors for each of the 24 Coronavirus Genomes.

i	x ⁰	y ⁰	EV iλ 1	EV iλ 2
1	8.7251e+003	567.4895	(0.0671,−0.9977)	(−0.9977,−0.0671)
2	9.1181e+003	231.8617	(0.0265,−0.9996)	(−0.9996,−0.0265)
3	9.1658e+003	854.0672	(0.0891,−0.9960)	(−0.9960,−0.0891)
4	9.8471e+003	678.7491	(0.0682,−0.9977)	(−0.9977,−0.0682)
5	9.8494e+003	669.8507	(0.0683,−0.9977)	(−0.9977,−0.0683)
6	9.8708e+003	671.8188	(0.0678,−0.9977)	(−0.9977,−0.0678)
7	9.8504e+003	667.9839	(0.0684,−0.9977)	(−0.9977,−0.0684)
8	1.0225e+004	508.6553	(0.0456,−0.9990)	(−0.9990,−0.0456)
9	1.0217e+004	560.8241	(0.0484,−0.9988)	(−0.9988,−0.0484)
10	1.0166e+004	571.4215	(0.0492,−0.9988)	(−0.9988,−0.0492)
11	1.0266e+004	503.3193	(0.0457,−0.9990)	(−0.9990,−0.0457)
12	8.8359e+003	177.6139	(0.0271,−0.9996)	(−0.9996,−0.0271)
13	9.6653e+003	217.7081	(0.0348,−0.9994)	(−0.9994,−0.0348)
14	9.6644e+003	220.2759	(0.0347,−0.9994)	(−0.9994,−0.0347)
15	9.6693e+003	219.4720	(0.0345,−0.9994)	(−0.9994,−0.0345)
16	9.6690e+003	217.1652	(0.0346,−0.9994)	(−0.9994,−0.0346)
17	9.6687e+003	217.0494	(0.0346,−0.9994)	(−0.9994,−0.0346)
18	9.6602e+003	216.5541	(0.0347,−0.9994)	(−0.9994,−0.0347)
19	9.6587e+003	216.9280	(0.0347,−0.9994)	(−0.9994,−0.0347)
20	9.6601e+003	216.0181	(0.0346,−0.9994)	(−0.9994,−0.0346)
21	9.6583e+003	216.5654	(0.0347,−0.9994)	(−0.9994,−0.0347)
22	9.6601e+003	216.0584	(0.0346,−0.9994)	(−0.9994,−0.0346)
23	9.6656e+003	220.1538	(0.0347,−0.9994)	(−0.9994,−0.0347)
24	9.6724e+003	219.6501	(0.0346,−0.9994)	(−0.9994,−0.0346)

The Geometric Center and Two Eigenvectors for each of the 24 Coronavirus Genomes. To facilitate the quantitative comparison of different species in terms of their collective parameters, we introduce a distance scale and an angle scale as defined below. Suppose that there are two species i and j, the parameters are x ,y ,λ,λ,x ,y ,λ, λ, respectively, where (x ,y ) is the geometrical center of the curve belonging to species i. λ,λ are the two eigenvalues of matrix CM corresponding to species i. The distance d between the two points is.39 where d denotes the distance between the geometric centers of the ith and the jth genomes, and M is the total number of all genomes (M = 24, here). Then we obtain a real M × M symmetric matrix whose elements are d . To reflect the differences between the trends of every two 2D curves, the angles between the corresponding eigenvectors of every two genomes are used. The 2D vectors are denoted as follows: The angle between the two vectors is denoted as follows: The sum of τ over k for given i,j can be used to reflect the trend information of the eigenvectors involved Consequently, two sets of parameters are obtained. The first reflects the difference of center positions represented by the Euclidean distance between the geometric centers. The second indicates the difference of the trends of the 2D curves represented by the related eigenvectors. The overall distance D between the species i and j is defined by Accordingly, a real symmetric M × M matrix D is obtained and used to reflect the evolutionary distance between the species i and j. The clustering tree is constructed using the UPGMA method in PHYLIP package (http://evolution.genetics.washington.edu/phylip.html). The final phylogenetic tree is drawn using the DRAWGRAM program in the PHYLIP package. In Figure 4, we present the phylogenetic tree belonging to 24 species.

Figure 4

Phylogenetic tree.

Conclusion

Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, for example, whole genome phylogeny, and the evolutionary models may not always be correct. Our representation provides a direct plotting method to denote DNA sequences without degeneracy. From the DNA graph, the A, T, G, and C usage as well as the original DNA sequence can be recaptured mathematically without loss of textual information. The current 2D graphical representation of DNA sequences provides different approaches for constructing the phylogenetic tree. Unlike most existing phylogeny construction methods, the proposed method does not require multiple alignment. Also, both computational scientists and molecular biologists can use it to analysis DNA sequences efficiently with different parameters of n and m.

23 in total

1. On the similarity of DNA primary sequences.

Authors: M Randić; M Vracko
Journal: J Chem Inf Comput Sci Date: 2000 May-Jun

2. Indexing scheme and similarity measures for macromolecular sequences.

Authors: C Raychaudhury; A Nandy
Journal: J Chem Inf Comput Sci Date: 1999 Mar-Apr

3. On 3-D graphical representation of DNA primary sequences and their numerical characterization.

Authors: M Randić; M Vracko; A Nandy; S C Basak
Journal: J Chem Inf Comput Sci Date: 2000 Sep-Oct

4. Characterization of a novel coronavirus associated with severe acute respiratory syndrome.

Authors: Paul A Rota; M Steven Oberste; Stephan S Monroe; W Allan Nix; Ray Campagnoli; Joseph P Icenogle; Silvia Peñaranda; Bettina Bankamp; Kaija Maher; Min-Hsin Chen; Suxiong Tong; Azaibi Tamin; Luis Lowe; Michael Frace; Joseph L DeRisi; Qi Chen; David Wang; Dean D Erdman; Teresa C T Peret; Cara Burns; Thomas G Ksiazek; Pierre E Rollin; Anthony Sanchez; Stephanie Liffick; Brian Holloway; Josef Limor; Karen McCaustland; Melissa Olsen-Rasmussen; Ron Fouchier; Stephan Günther; Albert D M E Osterhaus; Christian Drosten; Mark A Pallansch; Larry J Anderson; William J Bellini
Journal: Science Date: 2003-05-01 Impact factor: 47.728

5. A new sequence distance measure for phylogenetic tree construction.

Authors: Hasan H Otu; Khalid Sayood
Journal: Bioinformatics Date: 2003-11-01 Impact factor: 6.937

6. A Hidden Markov Model approach to variation among sites in rate of evolution.

Authors: J Felsenstein; G A Churchill
Journal: Mol Biol Evol Date: 1996-01 Impact factor: 16.240

7. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.

Authors: M Kimura
Journal: J Mol Evol Date: 1980-12 Impact factor: 2.395

8. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances.

Authors: J A Lake
Journal: Proc Natl Acad Sci U S A Date: 1994-02-15 Impact factor: 11.205

9. Molecular phylogeny of coronaviruses including human SARS-CoV.

Authors: Lei Gao; Ji Qi; Haibin Wei; Yigang Sun; Bailin Hao
Journal: Chin Sci Bull Date: 2003

10. Bioinformatics analysis of SARS coronavirus genome polymorphism.

Authors: Gordana M Pavlovic-Lazetic; Nenad S Mitic; Milos V Beljanski
Journal: BMC Bioinformatics Date: 2004-05-25 Impact factor: 3.169

11 in total

1. Phylogenetic analysis of protein sequences based on distribution of length about common sub-string.

Authors: Guisong Chang; Tianming Wang
Journal: Protein J Date: 2011-03 Impact factor: 2.371

2. Study of peptide fingerprints of parasite proteins and drug-DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks.

Authors: Lázaro Guillermo Pérez-Montoto; María Auxiliadora Dea-Ayuela; Francisco J Prado-Prado; Francisco Bolas-Fernández; Florencio M Ubeira; Humberto González-Díaz
Journal: Polymer (Guildf) Date: 2009-06-03 Impact factor: 4.430

3. Computational analysis and determination of a highly conserved surface exposed segment in H5N1 avian flu and H1N1 swine flu neuraminidase.

Authors: Ambarnil Ghosh; Ashesh Nandy; Papiya Nandy
Journal: BMC Struct Biol Date: 2010-02-22

4. A 2D graphical representation of the sequences of DNA based on triplets and its application.

Authors: Sai Zou; Lei Wang; Junfeng Wang
Journal: EURASIP J Bioinform Syst Biol Date: 2014-01-02

5. Natural/random protein classification models based on star network topological indices.

Authors: Cristian Robert Munteanu; Humberto González-Díaz; Fernanda Borges; Alexandre Lopes de Magalhães
Journal: J Theor Biol Date: 2008-07-22 Impact factor: 2.691

6. Using Gaussian model to improve biological sequence comparison.

Authors: Qi Dai; Xiaoqing Liu; Lihua Li; Yuhua Yao; Bin Han; Lei Zhu
Journal: J Comput Chem Date: 2010-01-30 Impact factor: 3.376

7. Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices.

Authors: Alcides Perez-Bello; Cristian Robert Munteanu; Florencio M Ubeira; Alexandre Lopes De Magalhães; Eugenio Uriarte; Humberto González-Díaz
Journal: J Theor Biol Date: 2008-10-17 Impact factor: 2.691

8. A Poisson model of sequence comparison and its application to coronavirus phylogeny.

Authors: Xiaoqi Zheng; Yufang Qin; Jun Wang
Journal: Math Biosci Date: 2008-12-06 Impact factor: 2.144

9. Generalized lattice graphs for 2D-visualization of biological information.

Authors: H González-Díaz; L G Pérez-Montoto; A Duardo-Sanchez; E Paniagua; S Vázquez-Prieto; R Vilas; M A Dea-Ayuela; F Bolas-Fernández; C R Munteanu; J Dorado; J Costas; F M Ubeira
Journal: J Theor Biol Date: 2009-07-29 Impact factor: 2.691

10. QSAR for RNases and theoretic-experimental study of molecular diversity on peptide mass fingerprints of a new Leishmania infantum protein.

Authors: Humberto González-Díaz; María A Dea-Ayuela; Lázaro G Pérez-Montoto; Francisco J Prado-Prado; Guillermín Agüero-Chapín; Francisco Bolas-Fernández; Roberto I Vazquez-Padrón; Florencio M Ubeira
Journal: Mol Divers Date: 2009-07-04 Impact factor: 2.943