Guisong Chang1, Tianming Wang2. 1. School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China; Department of Mathematics, Northeastern University, Shenyang 110004, PR China. 2. School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
Abstract
The focus of the research is on the analysis of genome sequences. Based on the inter-nucleotide distance sequence, we propose the conditional multinomial distribution profile for the complete genomic sequence. These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences. We use this distance measure to classify chromosomes according to species of origin, to build the phylogenetic tree of 24 complete genome sequences of coronaviruses. Our results demonstrate the new method is powerful and efficient. Copyright Â
The focus of the research is on the analysis of genome sequences. Based on the inter-nucleotide distance sequence, we propose the conditional multinomial distribution profile for the complete genomic sequence. These profiles can be used to define a very simple, computationally efficient, alignment-free, distance measure that reflects the evolutionary relationships between genomic sequences. We use this distance measure to classify chromosomes according to species of origin, to build the phylogenetic tree of 24 complete genome sequences of coronaviruses. Our results demonstrate the new method is powerful and efficient. Copyright Â
A great volume of available genomic data has made possible analysis of large sets of organisms at the whole genome scale. However, given that most genomes contain millions to billion nucleotides, traditional molecular analysis methods based on multiple sequence alignment become impractical due to their high computation complexity (Vinga and Almeida, 2003). Consequently, considerable efforts have been made to seek for the alignment-free method for sequence analysis. The first, mainly based on graphical representation of the sequence, is very convenient for studying several selected cases (Liao et al., 2005, Jeffrey, 1990, Nandy, 1994, Nandy and Nandy, 1995, Nandy and Nandy, 2003, Randic and Vracko, 2000; Randic et al., 2001, Randic et al., 2003a, Randic et al., 2003b, Randic et al., 2006; Randic and Balaban, 2003, Randic, 2008, Zhang and Zhang, 1994). One of the aim of graphical representation is to identity regions of interest or the distribution of base along the sequence visually (Zhang and Zhang, 1994). The second approach, has been proposed to characterize the DNA sequence (Akhtar et al., 2007). For this purpose one has to find representative descriptors that characterize an abstract mathematical representation of the biological sequence (Dai et al., 2006, He and Wang, 2002). A commonly used numerical characterization of the sequence is to consider binary sequences that describe the position of each nucleotide (Voss, 1992). Different approaches are described in a recent review (Nandy et al., 2006).Nair and Mahalashmi (2005) proposed the inter-nucleotide distance as a new DNA numerical profile. Any DNA sequence can be converted into a unique numerical sequence with the same length. In the representation, each number represents the distance of a nucleotide to the next occurrence of the same nucleotide. Meanwhile Nair and Mahalashmi (2005) employed discrete Fourier transformation to the inter-nucleotide distance sequence and indicated that this method has a discriminatory capability for highlighting the promoter region of gene sequence. However, Akhtar and Epps (2008) proved that it has poor accuracy in the exon prediction. Afreixo et al. (2009) developed a new method to analyze the inter-nucleotide distance sequence and extracted some interesting features of the DNA sequence. Four nucleotide inter-nucleotide distance distributions and a global distance distribution were given to each genome sequence. In each nucleotide inter-nucleotide distance distribution, only the total number of three other nucleotides was considered (Afreixo et al., 2009). In fact, we can extract more information about the genome sequence from their inter-nucleotide distance sequences.Motivated by the aforementioned work, we construct four conditional multinomial distributions from four inter-nucleotide distance sequences. In case of the inter-nucleotide distance sequence about nucleotide A, the number of the nucleotide C, the number of the nucleotide G and the number of the nucleotide T would follow multinomial distribution given that inter-nucleotide distance is k. This multinomial distribution will be called the conditional multinomial distribution. The relative error vector derived from the conditional multinomial distribution then can be used as a genomic signature that identifies each species. This approach allows us to perform comparative analysis between complete genome sequences. In fact, we propose a new evolutionary information representation, complete multinomial composition vector (CMCV), by using a collection of multinomial composition vectors. These multinomial composition vectors are built on the relative error vectors of conditional multinomial distributions with k, where k is within a range. The range of k is determined to ensure that the CMCV contains the largest amount of evolutionary information hidden in the whole genomic data. We then define the evolutionary distance between two genomes based on their complete multinomial composition vectors. The proposed method is tested by phylogenetic analysis on 24 coronavirus genomes. Our results demonstrate that the new method is powerful and efficient.
Materials and methods
Inter-nucleotide distance sequence
A DNA sequence, of length n, can be viewed as a linear sequence of n symbols from a finite alphabet . The inter-nucleotide distance was originally introduced by Nair and Mahalashmi (2005). The global inter-nucleotide distance sequence referred to as GIN, was defined as follows:Given a DNA sequence , , where k=min value of i such that
else k=n−m. We show below, as an example, the GIN for a short DNA fragment AGTTCTACCAGC is given asFrom the global inter-nucleotide distance sequence GIN, we can get the inter-nucleotide distance sequence to the nucleotide . Four inter-nucleotide distance sequences for the same short DNA segment used previously were given asA similar inter-nucleotide distance sequence to the nucleotide was defined by Afreixo et al. (2009). The four inter-nucleotide distance sequences for the short DNA fragment AGTTCTACCAGC: considering that the symbolic sequence is circular. The corresponding global distance sequence referred to as CIN is exemplified below for the same short DNA segment used previously, which is slightly different from the non-circular approach used by Nair and Mahalashmi (2005).
Conditional multinomial distribution
From the definition of the inter-nucleotide distance sequence, we clearly see that the total number of three other nucleotides was only considered in each inter-nucleotide distance sequence (Afreixo et al., 2009). In fact, we can count the number of each nucleotide about the genome sequence from its inter-nucleotide distance sequence. Consequently, we can derive four conditional multinomial distributions from the corresponding inter-nucleotide distance sequences.We take the inter-nucleotide distance sequence about nucleotide A (CIN
) as an example. Considering the case of (), let and be the occurrence probabilities of nucleotides C, G, and T, respectively, between the nearest two nucleotide A. If the nucleotide sequence was generated by an independent and identically distributed (i.i.d) random process, the number of nucleotide C, G and nucleotide T between the nearest two nucleotide A would follow a multinomial distribution. In fact, the joint probability function of (the number of C, G, T, between the nearest two nucleotide A, respectively, given that CIN
=k) iswhere , , i=1, 2, 3. The nucleotide occurrence probability is estimated by the relative frequency , where N
is the times of CIN
=k, is the total number of C between the nearest two nucleotide A when the inter-nucleotide distance CIN
=k. The nucleotide occurrence probability and can be obtained in the similar method. The term reference conditional multinomial distribution, applied to a DNA sequence, describes the number of nucleotide C, G and nucleotide T would follow that the inter-nucleotide distance sequence about nucleotide A is given, if its nucleotides are randomly determined, with probabilities equal to the relative conditional frequencies, independently of each other.
Complete multinomial composition vector
From the perspective of molecular evolution, conditional multinomial distribution may reflect both the results of random mutation and selective evolution. Mutations have been taking place randomly at molecular level and natural selections shape the direction of evolution. Many neutral mutations may remain and play a role of random background. One should subtract the random background from the simple counting result in order to highlight the contribution of selective evolution (Chang and Wang, 2011, Ding et al., 2010, Gao et al., 2006). In this work, we propose a new conditional multinomial distribution representation which reveals the relative difference of biological sequence from sequence generated by an independent random process to remove the random background.For a fixed k, we can obtain a measured conditional multinomial distribution and a reference conditional multinomial distribution for a certain nucleotide . For a certain pattern from the conditional multinomial distribution, we can define the multinomial composition value
as follows: where the is the measured relative frequency of the pattern , the relative frequency of the pattern from the reference conditional multinomial distribution can be computed by (1). All these multinomial composition values can be sorted in some order to form a vector for the genome S, where m denotes the total number of patterns under consideration. Moreover, four vectors , , and are sorted in some order to form a vector that represents the whole genome S. The vector defined by all these multinomial composition values is referred to as the k-order multinomial composition vector (k−MCV).For only a fixed k, the k-order multinomial composition vector of the whole genome S may lost some evolutionary information. The complete multinomial composition vector (CMCV) of the whole genome is the concatenation of , denoted by CMCV (S, k), with the intention to use as much genomic information as possible.
Results and discussions
The conditional multinomial distribution profile of chromosomes
We begin with the largest fragments of available DNA sequences, the chromosomes of eukaryotes listed in Table 1
. The conditional multinomial distribution profile shown in Fig. 1
corresponds to three different chromosomes of Saccharomyces cerevisiae. In the case of inter-nucleotide distance sequence CIN
(k=5), we firstly convert the possible value of the into one-dimensional value by the order of alphabet. We secondly plot the measured conditional multinomial distribution by bar and the reference conditional multinomial distribution by line. We clearly see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of S. cerevisiae. Three other cases are obtained in the similar way. We see again the similarity between the conditional multinomial distribution profiles about a certain nucleotide for the various chromosomes.
Table 1
Labels for chromosomes.
No.
Strain name
Accession
Chromosome
1
m14
NT_002582
M. musculus chromosome 14
2
m17
NT_002588
M. musculus chromosome 17
3
MX
NT_003030
M. musculus chromosome X
4
sc3
NC_001135
S. cerevisiae chromosome 3
5
sc5
NC_001137
S. cerevisiae chromosome 5
6
sc9
NC_001141
S. cerevisiae chromosome 9
7
ce1
NC_000965
C. elegans chromosome 1
8
ce2
NC_000966
C. elegans chromosome 2
9
ce3
NC_000967
C. elegans chromosome 3
Fig. 1
Conditional multinomial distribution profile for three different chromosomes of Saccharomyces cerevisiae. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).
Labels for chromosomes.Conditional multinomial distribution profile for three different chromosomes of Saccharomyces cerevisiae. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).If we repeat this experiment for the chromosomes of Caenorhabditis elegans we get the same result. Again, when we plot the conditional multinomial distribution profile about a certain nucleotide we see a pattern of peaks and valleys which occur at the identical locations for all chromosomes of C. elegans. We demonstrate this with three chromosomes of C. elegans in Fig. 2
. Again, while the pattern of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide is the same for all chromosomes of C. elegans, this pattern is distinctly different from the pattern of peaks and valleys in the S. cerevisiae profile about the same nucleotide.
Fig. 2
Conditional multinomial distribution profile for three different chromosomes of C. elegans. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).
Conditional multinomial distribution profile for three different chromosomes of C. elegans. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).Finally we repeat the experiment for Mouse. The result is shown in Fig. 3
. Once more we obtain a sequence of peaks and valleys in the conditional multinomial distribution profile about a certain nucleotide which are the same for all chromosomes of Mouse, and this pattern of peaks and valleys is different from the pattern in the S. cerevisiae and C. elegans profiles.
Fig. 3
Conditional multinomial distribution profile for three different chromosomes of Mouse. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).
Conditional multinomial distribution profile for three different chromosomes of Mouse. The histogram is from the measured conditional multinomial distribution and the line indicates the reference conditional multinomial distribution with parameters estimated from the data (5−MCV).
Phylogenetic analysis
The complete multinomial composition vector of each complete genome provides a simple, easily computable signature that identifies each species. The signature can be used in application where evolutionary relationships need to be deduced using large genomic sequence. Distances between sets of genomic sequences can be obtained without the need for multiple sequence alignment. Phylogenetic trees are generated by putting the pairwise distance matrix into UPGMA method in the PHYLIP package (Felsensein, 1989).The outbreak of atypical pneumonia referred as severe acute respiratory syndrome coronavirus (SARS-CoVs) in 2003 had caught more attention to the relationship between the SARS-CoVs and the others coronaviruses. The 24 complete coronavirus genomes used in this paper were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2
. Generally, coronavirus can be classified into three groups according to serotypes. Group I and group II contain mammalian viruses, whereas group III contains only avian. Many investigations have attempted to identify the phylogenetic position of SARS-CoVs. However, this is still a controversial topic-alignment-based methods showed that SARS-CoVs are not closely related to any groups and form a new group (Marra et al., 2003, Rota et al., 2003); maximum likelihood tree built from a fragment of the spike protein preferred SARS-CoVs clustering with group II (Liò and Goldman, 2004); while an information-based method, which makes use of the whole genome sequences, indicated that SARS-CoVs are close to the group I rather than from a new group (Yang et al., 2005). Based on the complete multinomial composition vector, we build the phylogenetic tree of the 24 coronaviruses listed in Table 2. The phylogenetic tree is built using the UPGMA programs in the PHYLIP package and the distance matrix is computed using the Euclidean distance (Felsensein, 1989).
Table 2
The accession number, abbreviation, name, and length for each of the 24 coronavirus genomes.
No.
Accession
Abbreviation
Genome
Length (nt)
1
NC_002645
HCov-229E
Human coronavirus 229E
27,317
2
NC_002306
TGEV
Transmissible gastroenteritis virus
28,586
3
NC_003436
PEDV
Porcine epidemic diarrhea virus
28,033
4
U00735
BCoVM
Bovine coronavirus strain Mebus
31,032
5
AF391542
BCoVL
Bovine coronavirus isolate BCoV-LUN
31,028
6
AF220295
BCoVQ
Bovine coronavirus Quebec
31,100
7
NC_4030
BCoV
Bovine coronavirus
31,028
8
AF208067
MHVM
Murine hepatitis virus strain ML-10
31,233
9
AF201929
MHV2
Murine hepatitis virus strain 2
31,276
10
AF208066
MHVP
Murine hepatitis virus strain Penn 97-1
31,112
11
NC_001846
MHV
Murine hepatitis virus
31,357
12
NC_001451
IBV
Avian infectious bronchitis virus
27,608
13
AY278488
BJ01
SARS coronavirus BJ01
29,725
14
AY278741
Urbani
SARS coronavirus Urbani
29,727
15
AY278491
HKU-39849
SARS coronavirus HKU-39849
29,742
16
AY278554
CUHK-W1
SARS coronavirus CUHK-W1
29,736
17
AY282752
CUHK-Su10
SARS coronavirus CUHK-SulO
29,736
18
AY283794
SIN2500
SARS coronavirus Sin2500
29,711
19
AY283795
SIN2677
SARS coronavirus Sin2677
29,705
20
AY283796
SIN2679
SARS coronavirus Sin2679
29,711
21
AY283797
SIN2748
SARS coronavirus Sin2748
29,706
22
AY283798
SIN2774
SARS coronavirus Sin2774
29,711
23
AY291451
TW1
SARS coronavirus TW1
29,729
24
NC_004718
TOR2
SARS coronavirus
29,751
The accession number, abbreviation, name, and length for each of the 24 coronavirus genomes.Our results based on analysis of the complete multinomial composition vector of 24 coronavirus genomes have some notable distinction from the previous phylogenetic study using an information-based similarity index (Yang et al., 2005). As can be seen from Fig. 4
, our method indicates that SARS-CoVs are not closely related to any of the previously characterized coronaviruses and form a distinct group (group IV). Our results also show that group II, BCoV, BCoVL, BCoVM, etc., are grouped in a monophyletic clade. This result is also mainly in accordance with the conclusions from the alignment-based method (Marra et al., 2003, Rota et al., 2003) and the alignment-free method (Liu et al., 2007). Moreover the Robinson–Foulds distance between our tree and the result of Liu's is only 26.
Fig. 4
The NJ tree of 24 complete coronavirus genomes is constructed by CMCV (S, 7).
The NJ tree of 24 complete coronavirus genomes is constructed by CMCV (S, 7).
The selection of k in CMCV (S, k)
The selection of k in CMCV (S, k) is very important to capture rich evolutionary information of DNA sequence. In the case of k=1, there is no nucleotide between two adjacent nucleotides. In the case of k=2, there is only one nucleotide between two adjacent nucleotides. Therefore the multinomial composition value of a certain pattern is zero. The CMCV does not contain these multinomial composition values. Certainly, a large value of k will give a vector containing finer evolutionary information. However, many patterns will not occur in the conditional multinomial distribution with a large value of k. From the view of information theory, some information may be lost and noise will dominate if a large value of k is considered. To determine the upper bound of the value of k, we will introduce a scoring scheme to estimate how important a conditional multinomial distribution is.scoring scheme: For a fixed k, let be a pattern in the conditional multinomial distribution, with its multinomial composition value in genome i (could be found in k−MCV). Define the expected multinomial composition value for pattern to be the average of all composition values across all whole genomes, and denoted as, i.e.
n genomes in the dataset. The standard measures the deviation of a set of values from its expected value by summing up the deviations of each element. Clearly, the higher value it has, the more valuable pattern is. Thus, we may define a score for the conditional multinomial distribution with a fixed k as where the first sum is for all patterns of the conditional multinomial distribution with a fixed k.We believe by considerably extending the basic pattern counting idea and thus studying their underlying distribution, we are able to discover unusual patterns to automatically distinguish their roles in shaping the evolution. In this case, the largest score of conditional multinomial distribution, the k−MCV might be considered as the most representative for the species, while not as abnormal outliers from the pure statistical analysis.We listed the score of the conditional multinomial distribution with a fixed k (within the range [3,9]) from the dataset of 24 complete coronavirus genomes in Table 3
. The score of CMCV can be defined as sum of scores of k−MCV involved in the CMCV. From Table 3, it is clearly that there is no large difference after the 7−MCV is added in the CMCV. Moreover, we can define the relative ratio of information involved in a certain conditional multinomial distribution with a fixed k as the k−MCV to the CMCV which will involve the k−MCV. Form Table 3, we can clearly see that the relative ratio of 7−MCV is the maximum . Therefore, we select CMCV (S,7) to represent the genome S in the phylogenetic analysis of the 24 coronaviruses.
Table 3
Scores for some conditional multinomial distributions.
k
3
4
5
6
7
8
9
Score
1194
26
104
84
839
211
313
SumScore
1194
1220
1324
1408
2247
2458
2771
Ratio
–
261194
1041220
841324
8391408
2112247
3132458
Scores for some conditional multinomial distributions.
Conclusion
Description and comparison of DNA sequences are still important subjects in bioinformatics. DNA sequence databases have accumulated much data on biological evolution during billions of years, consequently novel concepts and methods are urgent need to reveal the biological functions of DNA sequences information, to investigate relationships of DNA sequences with biological evolution, cellular function, genetic mechanism and occurrence of illness. In this paper, we propose four conditional multinomial distributions about each nucleotide for complete genome sequence based on the inter-nucleotide distance sequences. From the conditional multinomial distribution profiles about nine chromosomes, we note that the relative error vector between the measured conditional multinomial distribution and the reference conditional multinomial distribution can be used as a genomic signature, thus allowing the comparison of species. Therefore, it is straightforward to generate a phylogenetic tree based on the Euclidean distances of complete multinomial composition vectors.In order to test the validity of our method, we select the complete genome sequences of 24 coronaviruses which were used by Liu et al. (2007). The phylogenetic tree can be gotten through the distance matrices using the UPGMA method. Fig. 4 is the phylogenetic tree of the 24 genome sequences based on the distance matrix of the complete multinomial composition vector, using UPGMA method. We find that the tree is mainly consistent with the tree constructed by Liu et al. (2007). Fig. 4 also indicates that SARS-CoVs are not closely related to any groups and form a new group.Overall our results highlight that the conditional multinomial distribution profiles have the ability to extract more information from the genome sequence. Thus this opinion can then be used to guide the development more powerful measures for sequence comparison with future possible improvement on the correlation structure of DNA.
Authors: Marco A Marra; Steven J M Jones; Caroline R Astell; Robert A Holt; Angela Brooks-Wilson; Yaron S N Butterfield; Jaswinder Khattra; Jennifer K Asano; Sarah A Barber; Susanna Y Chan; Alison Cloutier; Shaun M Coughlin; Doug Freeman; Noreen Girn; Obi L Griffith; Stephen R Leach; Michael Mayo; Helen McDonald; Stephen B Montgomery; Pawan K Pandoh; Anca S Petrescu; A Gordon Robertson; Jacqueline E Schein; Asim Siddiqui; Duane E Smailus; Jeff M Stott; George S Yang; Francis Plummer; Anton Andonov; Harvey Artsob; Nathalie Bastien; Kathy Bernard; Timothy F Booth; Donnie Bowness; Martin Czub; Michael Drebot; Lisa Fernando; Ramon Flick; Michael Garbutt; Michael Gray; Allen Grolla; Steven Jones; Heinz Feldmann; Adrienne Meyers; Amin Kabani; Yan Li; Susan Normand; Ute Stroher; Graham A Tipples; Shaun Tyler; Robert Vogrig; Diane Ward; Brynn Watson; Robert C Brunham; Mel Krajden; Martin Petric; Danuta M Skowronski; Chris Upton; Rachel L Roper Journal: Science Date: 2003-05-01 Impact factor: 47.728
Authors: Vera Afreixo; Carlos A C Bastos; Armando J Pinho; Sara P Garcia; Paulo J S G Ferreira Journal: Bioinformatics Date: 2009-09-16 Impact factor: 6.937