Literature DB >> 23876763

Study of LZ-word distribution and its application for sequence comparison.

Qi Dai¹, Zhaofang Yan, Zhuoxing Shi, Xiaoqing Liu, Yuhua Yao, Pingan He.

Abstract

Lempel-Ziv complexity has been widely used for sequence comparison and achieved promising results, but until now components' distribution in exhaustive history has not been studied. This paper investigated the whole distribution of LZ-words and presented a novel statistical method for sequence comparison. With the components' length in mind, we revised Lempel-Ziv complexity and obtained various sets of LZ-words. Instead of calculating the LZ-words' contents, we defined a series of set operations on LZ-word set to compare biological sequences. In order to assess the effectiveness of the proposed method, we performed two sets of experiments and compared it with alignment-based methods.

Entities: CellLine Chemical Disease Gene Species

Keywords: Lempel–Ziv complexity; Phylogenetic analysis; Set operation; Word set

Mesh：

Year: 2013 PMID： 23876763 PMCID： PMC7094135 DOI： 10.1016/j.jtbi.2013.07.008

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

With high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing new sequences to those with known functions is a key way of understanding the biology of an organism. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming is used to find an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignment-based algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982, Needleman and Wunsch, 1970, Smith and Waterman, 1981, Randic et al., 2011, Randic et al., 2013). Waterman (Waterman, 1995) and Durbin et al. (Durbin et al., 1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in: (i) computationally load with large biological databases and (ii) choice of the scoring schemes (Pham and Zuegg, 2004, Vinga and Almeida, 2003). Therefore, the emergence of research into the second class, alignment-free methods, is apparent and necessary to overcome critical limitations of alignment-based methods. Up to now, many alignment-free methods have been proposed, but they are still in the early development compared with alignment-based methods. One of the most widely used alignment-free approaches is statistical model, in which each sequence is first mapped into an m-dimensional vector according to its k-word frequencies, and sequence similarity can then be measured by distance measures, such as Euclidean distance (Blaisdell, 1986), Pearson's correlation coefficient (Fichant and Gautier, 1987), Kullback–Leibler discrepancy (Wu et al., 2001), Cosine distance (Stuart et al., 2002) among their corresponding vectors. Recently, Ewens and Grant (2005) studied probabilistic properties of words in sequences, deducted the exact distributions and evaluated its asymptotic approximations. When the k-words occurring in biological sequence are estimative probabilities rather than frequencies, they are more easily described by more complex models such as Markov model (Pham and Zuegg, 2004, Hao and Qi, 2004, Wu et al., 2001, Apostolico and Denas, 2008), mixed model (Kantorovitz et al., 2007) and Bernoulli model (Lu et al., 2008). Graphical representation is another widely used alignment-free method. It provides a simple way to view, sort and compare various gene sequences with their intuitive pictures and pattern. Randic et al. gave a comprehensive review on these methods (Randic, 2013b, Randic et al., 2003). In order to facilitate comparison of different biological sequences, they transformed graphical representations into some mathematical objects such as E matrix (Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005), D/D matrix(Li and Wang, 2003, Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005), L/L matrix (Randic, 2013a, Li and Wang, 2003, Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005) and their “high order” matrices (Yao and Wang, 2004, Song and Tang, 2005). Once a matrix was given, they calculated matrix invariants as descriptors of the sequence, such as the average matrix element, the average row sum, the leading eigen value and the Wiener number. But, for long sequences, these methods become less useful because they require complex repetitive computation to get matrix invariants. Recently, Out and Sayood introduced Lempel–Ziv (LZ) complexity to compute the distance between two DNA sequences (Out and Sayood, 2003). Because it is based on exact direct repeats, the LZ complexity works well with the small DNA alphabet. Unlike DNA sequences, protein sequences and RNA secondary structures consist of more complex alphabets and structure information, which poses more of a challenge for LZ complexity. So Bacha and Baurain, Liu and Wang, Chen and Zhang presented several strategies in which protein sequences or RNA secondary structures were encoded to a new alphabet prior to computation of the LZ complexity (Bacha and Baurain, 2005, Liu and Wang, 2006, Zhang and Chen, 2010, Zhang and Wang, 2010, Chen and Zhang, 2012). Zhang et al. found that the LZ complexity is strongly correlated with sequence length and proposed a normalized LZ complexity for sequence comparison (Zhang et al., 2009). Taking into account a specific kind of the inexact copy in the text, Li et al. generalized the LZ complexity and proposed a new sequence distance measure for sequence comparison (Li et al., 2010). Liu et al. introduced relative LZ complexity to depict the complexity relationship between two sequences (Liu et al., 2012). All above LZ-based methods have achieved promising results in biological sequence comparison, but they generally placed a heavy emphasis on the number of components in the exhaustive history, so little attention has been paid to the components themselves. In this paper, we used the proposed revised LZ complexity to obtain a series of LZ-words from the exhaustive history of biological sequences. Based on the LZ-word distributions, we constructed a sorted union LZ-word set from which an indicator sequence was obtained. We then calculated numerical characteristics of the indicator sequence to compare biological sequences. The performance of the proposed method was evaluated by the phylogenetic analysis and comparison with alignment-based method.

Method

LZ-words of DNA sequences

Given a finite alphabet , let U, V and W be sequences over it. L(U) is the length of the sequence U, is the i-th element of U, and is the subsequence of U starting at position i and ending at position j. Here, , for . Concatenating V and W can construct a new sequence U=VW, in this equation, V is named “a prefix” of U, and U is called “an extension” of V if there exists an integer i such that . An extension U=VW of V is reproducible from V denoted by , if there exists an integer such that , for k=1, 2,…, L(W). A non-null sequence U is producible from its prefix , denoted by , if . For example: with p=1. Note that, the producibility allows for an extra different symbol at the end. Usually, a DNA primary sequence can be taken as a string of letters A, G, C, and T, which denote the four nucleic acid bases: adenine, guanine, cytosine, and thymine, respectively. Let to be a DNA sequence. To indicate a substring of S that starts at position i and ends at position j, we write , where is, for . Any sequence S can be built using a production process where at its ith step , which is described as following: At the beginning, we had a null-sequence, denoted by . We then added a prefix to and obtained a new sequence S. If , we added a symbol “”after . Let a prefix , checked if can be reproduced from the sequence . If could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol “”following QR. If R could be reproduced from the set, then checked again if can reproduced from the sequence . If so, checked again if can reproduced from the sequence , and so on. There two possible cases: in the case , we ended the procedure and got new prefix ; in another case cannot be reproduced from the sequence, we got a new prefix QR and added a symbol“” behind it. Repeated the step (2) until produce S. Instead of focusing on the total number of components in the exhaustive history, we analyzed the components themselves. For convenience, we denoted a component in the exhaustive history as a LZ-word, and all the components in the exhaustive history as a LZ-word set. For example, the LZ-words of S=ATGGTCGGTTTC can be gotten through the following steps, where is used to separate the decomposition component: Generate a novel symbol A: Ø+A→A. Generate a novel symbol T: A+T→AT. Generate a novel symbol G: AT+G→ATG. Copy the longest fragment+generate a additional symbol GT: ATG+GT→ATGGT. Generate a novel symbol C: ATGGT+C→ATGGTC. Copy the longest fragment+generate a additional symbol GGTT: ATGGTC→ATGGTCGGTT. Copy the longest fragment TC: ATGGTCGGTT→ATGGTC GGTTTC. A, T, G, GT, C, GGTT and TC are the LZ-words of the sequence S. And {A, T, G, GT, C, GGTT, TC} is the LZ-word set of the sequence S.

Revised LZ-words of DNA sequences

LZ complexity of a sequence is measured by the minimal number of steps required for its synthesis in a certain process. For each step only two operations are allowed in the process: either generating an additional symbol which ensures the uniqueness of each component or copying the longest fragment from the part of a synthesized sequence. When a new decomposition component is generated, it should be checked whether it is copied from the longest fragment of the . Consequently, the length of LZ-word inevitably becomes large as production process going on. With this problem in mind, we proposed a revised LZ complexity that is described as following: At the beginning, we had a null-sequence, denoted by. We then added a prefix to and obtained a new sequence S. If , we added a symbol“”after . Let a prefix , checked if can be reproduced from the set. If could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol“”following QR. If R could be reproduced from the set, then checked again if can reproduced from the set. If so, checked again if can reproduced from the set, and so on. There two possible cases: in the case , we ended the procedure and got new prefix ; in another case cannot be reproduced from the set , we got a new prefix QR and added a symbol “” behind it. Repeated the step (2) until produce S. Take the above sequence S=ATGGTCGGTTTC as an example, we obtained its revised LZ-words through the following steps, where is used to separate the decomposition component: Generate a novel symbol A: Ø+A→A. Generate a novel symbol T: A+T→AT. Generate a novel symbol G: AT+G→ATG. Copy the fragment G+ generate a additional symbol T: ATG+GT→ATGGT. Generate a novel symbol C: ATGGT+C→ATGGTC. Copy the fragment G+generate a additional symbol G: ATGGTC→ATGGTCGG. Copy the fragment T+generate a additional symbol T: ATGGTC→ATGGTCGG⁎TT. Copy the fragment T+generate a additional symbol C: ATGGTCGGTT→ATGGTC GG TTTC. A, T, G, GT, C, GG, TT and TC are revised LZ-words of the sequence S. And {A, T, G, GT, C, GG, TT, TC} is revised LZ-word set of the sequence S. It is interesting to note that the maximum length of the revised LZ-word set is 2, significantly smaller than that of the LZ-word set.

Operation measure between different revised LZ-word sets

Given a DNA sequence, we can get a revised LZ-word set. Here, we are interested not only in using the revised LZ-word set to numerically characterize the biological sequences, but also in facilitating comparison of biological sequences. There is a large body of literatures on word statistics, where a sequence is interpreted as a succession of symbols (Reinert et al., 2000). A k-word is a series of k consecutive letters in a sequence. The word statistical analysis consists of counting occurrences of words and calculating their numerical characteristics. The standard approach for counting k-words in a sequence of length m is to use a sliding window of length k, shifting the frame one base at a time from position1 to m-k+1. Instead of counting the LZ-words' content, we analyzed the distribution diversity of revised LZ-words and designed an operation measure to compare biological sequences. Given two DNA sequences X and Y, we obtained their revised LZ-word sets and . We then blended and to compose anther set According to the length of revised LZ-words, the set is divided into several mutually exclusive sets whereWe then lined the elements of the set in the lexicographic order and got an ordered setFor example, if X=ATGCGTCGGTCCACCCACGTA and Y=ATCGGTCTGTTACAGACTACG are two given DNA sequences, we can get there , and sets: ={A, C, G, T, CA, CC, CT, GT, CAC, GTA, GTC}, ={A, C, G, T, AC, AG, CT, GT, ACT, GTC, GTT}, ={A, A, C, C, G, G, T, T, AC, AG, CA, CC, CT, CT, GT, GT, ACT, CAC, GTA, GTC, GTC, GTT}. Now we focus on the blend degree of two biological sequences. Given any pair of neighboring elements in set, there are two possible cases: if one is from () and the other is from (), we suppose there is transition operation () between them. Otherwise, they may both come from the same set (), we suppose there is extension operation (—) between them. Take above two sequences X and Y for an example, we first listed all the elements of the sets in a line with “” denoting them, and list all the elements of the sets in a line with “” denoting them. We then presented all the operations between the sets and the sets based on the set, which is shown in Fig. 1.

Fig. 1

All the transition operations and extension operations between the sets and the sets according to the set.

All the transition operations and extension operations between the sets and the sets according to the set. It is interesting to note that the transition operations in the operation figure indicate the similarity between the sets and the set, and the extension operations imply their diversity. That is to say, the more the extension operations are, the more similar the sets and the set are. According to that, we define length of the operations as follows: Given an operation with length, we counted its total appearances () in operation figure. Since varies with different value, it can be regarded as a discrete random variable. Given a random variable, and a positive integer , is the probability that takes the value The collection of pairs, for all positive integer , is the probability distribution of the listed as follows:Take all the operations in Fig. 1 for an example, the probability distribution of the operations isBased on the operation distribution function, we calculated its expectation and propose an operation measure () between two sequence X and Y, , the average length of the operation, is depended on both the extension operations and transition operations. It is important to note that only satisfies the identity and symmetry, it does not satisfy inequality conditions. So it is only a dissimilarity measure for sequence comparison. We are interested in for two reasons. First of all, it provides an opportunity to study the components' distribution which is, in some ways, more singular than the total number of components in the exhaustive history. The second reason involves the lengths of the operations because differencing lengths of the operations strengthens the effects of the different operations.

Results and discussion

Comparison of component distribution in the exhaustive history between Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity

One of the characteristics of the revised Lempel–Ziv complexity is to check whethercan be reproduced from the set instead of from the set . To find their difference, we compared their LZ-word's distributions. We first compared their component difference in the exhaustive histories. For example, HCoV-229E is a given sequence of Human coronavirus, its length is 27,317 with accession number NC_002645. With Lempel–Ziv complexity and revised Lempel–Ziv complexity, we got two exhaustive histories. We then deleted all the symbols “” in the exhaustive histories and obtained two new deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. It is difficult for us to observe sequence difference directly, but we can calculate k-word counts of the deduced sequences to assess their difference. Fig. 2 is the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4. Interestingly, the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ are similar in Fig. 2. That is to say, the sequence information held through Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity operation is similar.

Fig. 2

The comparison of k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4.

The comparison of k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4. In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique used to compare means of two or more samples (using the F distribution). Here, we used a one-way ANOVA to test whether the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ differ from each other. The F value obtained by one-way ANOVA test tells us whether the data is significantly different from the Gaussian distribution or not. We rejected the hypothesis if the test is significant at the 0.05 level. Since the F-value is 0.91>F0.05, there is no significant difference between the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. That is to say, the components in the exhaustive history between Lempel–Ziv complexity and revised Lempel–Ziv complexity are similar. We found that the total number of components in the exhaustive history with the revised Lempel–Ziv (LZ) complexity algorithm is 4491, which is 1026 larger than Lempel–Ziv (LZ) complexity algorithm. In addition to components' distribution in the exhaustive history, we also compared the distribution of lengths of the components in the exhaustive history. Take the HCoV-229E as an example, Fig. 3 is the comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv complexity and revised Lempel–Ziv complexity. It is interesting to note that there is a great difference between the lengths of the components in the exhaustive history. The maximum length of the components obtained by Lempel–Ziv complexity is 16, while that of the components obtained by revised Lempel–Ziv complexity is only 10. The most appearance length of the components obtained by revised Lempel–Ziv complexity is 6, which is 2 smaller than that of the components obtained by Lempel–Ziv complexity.

Fig. 3

Comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity.

Comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity. Comparison between Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity illustrates that they can both extract the similar information of the primary sequences, but the component lengths in the exhaustive history obtained by the revised Lempel–Ziv complexity are obviously smaller than Lempel–Ziv complexity. So the revised Lempel–Ziv complexity is a better way to make the components easier to handle.

Influence of set splitting methods on operation measure

Given two DNA sequences X and Y, we obtained their revised LZ-word sets and with the revised Lempel–Ziv complexity. We then blended and to compose anther set . In order to highlight the influence of different LZ-words' size, we divided the set into several mutually exclusive sets according to the length of revised LZ-word t. It is worthy to note that mutually exclusive sets rely heavily on set splitting methods. In order to evaluate the influence of the set splitting methods, we adopted the operation measure to classify HEV Genotypes with step-wise refinement of set splitting methods. HEV (Hepatitis E virus) is a non-enveloped, positive-sense, single-stranded RNA virus and belongs to Hepevirus genus under the separate family of Hepeviridea (Lu et al., 2006). The genome of HEV is approximately 7.2 kb in length and contains a short 5′ untranslated region (5′ UTR), three overlapped open reading frames (ORF1, ORF2, and ORF3′) and a short 3′ UTR. We retrieved a total of 48 full-length HEV genome sequences from NCBI (http://www.ncbi.nlm.nih.gov/). Abbreviation for the strains, accession number, nucleotide length, country, and genotype of all HEV genomes (Lu et al., 2006) are described in Table 1. And the 48 HEV genomes were distinctly clustered into four genotypes by the traditional classification (Liu et al., 2008).

Table 1

Abbreviation for the strains, accession number, nucleotide length, genotype, acronym and country for each of the 48 complete HEV genomes.

No	Strain name	Accession	Length	Genotype	Abbreviation	Country
1	B1 (Bur-82)	M73218	7207	I	AA	Burma (Rangoon)
2	B2 (Bur-86)	D10330	7194	I	AB	Burma (Rangoon)
3	I2 [Mad-93]	X99441	7194	I	AC	India (Madras)
4	I3	AF076239	7194	I	AD	India (Hyderabad)
5	Np1(TK15/92)	AF051830	7199	I	AE	Nepal (Kathamandu)
6	P2[Abb-2B]	AF185822	7143	I	AF	Pakistan (Abbottabad)
7	Yam-67	AF459438	7206	I	AG	India (Yamuna Nagar)
8	C1(CHT-88)	D11092	7207	I	AH	China (Xinjiang, Hetian)
9	C2(KS2–87)	L25595	7221	I	AI	China (Xinjiang, Kashi)
10	C3(CHT-87)	L08816	7176	I	AJ	China (Xinjiang, Hetian)
11	C4(Uigh179)	D11093	7194	I	AK	China (Xinjiang, Uighur)
12	China Hebei	M94177	7200	I	AR	China (Hebei)
13	P1(Sar-55)	M80581	7138	I	AM	Pakistan (Sargodha)
14	I1(FHF)	X98292	7202	I	AN	India
15	Morocco	AY230202	7212	I	AO	Morocco
16	T3	AY204877	7170	I	AP	Chad
17	M1	M74506	7180	II	BB	Mexico (Telixtac)
18	HE-JA10	AB089824	7262	III	CA	Japan (Tokyo)
19	JKN-Sap	AB074918	7256	III	CB	Japan (Sapporo)
20	JMY-HAW	AB074920	7240	III	CC	Japan (Sapporo)
21	swUS1	AF082843	7207	III	CD	USA
22	US1	AF060668	7202	III	CE	USA (Minnesota)
23	US2	AF060669	7277	III	CF	USA (Tennessee)
24	JBOAR1-Hyo04	AB189070	7247	III	CG	Japan (Hyogo)
25	JDEER-Hyo03L	AB189071	7230	III	CH	Japan (Hyogo)
26	JJT-KAN	AB091394	7218	III	CI	Japan (Kanagawa)
27	JMO-Hyo03L	AB189072	7180	III	CJ	Japan (Hyogo)
28	JRA1	AP003430	7230	III	CK	Japan (Tokyo)
29	JSO-Hyo03L	AB189073	7180	III	CR	Japan (Tokyo)
30	JTH-Hyo03L	AB189074	7180	III	CM	Japan (Tokyo)
31	JYO-Hyo03L	AB189075	7180	III	CN	Japan (Tokyo)
32	swJ570	AB073912	7257	III	CO	Japan (Tochigi)
33	Kyrgyz	AF455784	7239	III	CP	Kyrgyzstan
34	Arkell	AY115488	7255	III	CQ	Canada (Ontario, Guelph)
35	HE-JA1	AB097812	7258	IV	DA	Japan (Hokkaido)
36	HE-JK4	AB099347	7250	IV	DB	Japan (Tochigi)
37	HE-JI4	AB080575	7186	IV	DC	Japan (Tochigi)
38	JAK-Sai	AB074915	7236	IV	DD	Japan (Saitama)
39	JKK-Sap	AB074917	7235	IV	DE	Japan (Sapporo)
40	JSM-Sap95	AB161717	7202	IV	DF	Japan (Hokkaido)
41	JSN-Sap-FH	AB091395	7234	IV	DG	Japan (Hokkaido)
42	JSN-Sap-FH02C	AB200239	7251	IV	DH	Japan (Hokkaido)
43	JTS-Sap02	AB161718	7202	IV	DI	Japan (Hokkaido)
44	JYW-Sap02	AB161719	7202	IV	DJ	Japan (Hokkaido)
45	swJ13–1	AB097811	7258	IV	DK	Japan (Hokkaido)
46	swCH25	AY594199	7270	IV	DR	China (Uighur)
47	T1	AJ272108	7232	IV	DM	China (Beijing)
48	CCC220	AB108537	7193	IV	DN	China (Changchun)

Abbreviation for the strains, accession number, nucleotide length, genotype, acronym and country for each of the 48 complete HEV genomes. This experiment aims at assessing how well the operation measure with step-wise refinement of set splitting methods performs on classification. Here, set splitting methods with the step-wise refinement (SSM) are:In relation to the clustering literature (Handl et al., 2005), Neighbor-joining (Felsenstein, 1989), a classic tree construction algorithm, can be considered as hierarchical methods. These results are represented in Fig. 4.

Fig. 4

Cluster trees of 48 HEV genomes using tree construction algorithm Neighbor-joining based on the proposed operation measure with SSM1, SSM2, SSM3, SSM4, and SSM5.

Cluster trees of 48 HEV genomes using tree construction algorithm Neighbor-joining based on the proposed operation measure with SSM1, SSM2, SSM3, SSM4, and SSM5. To evaluate the performance of the operation measure for HEV genotypes classification, we counted the number of misplaced HEV genotype against a gold standard. For the classification of HEV genotypes, we took the traditional classification as the gold standard (Lu et al., 2006). The numbers of misplaced HEV genotype for the operation measure with SSM 1, SSM 2, SSM 3, SSM 4, and SSM 5 are 1, 1, 2, 1 and 0, respectively. These results indicate that the higher the refinement scheme is, the higher the operation measure efficiency is.

Phylogenetic analysis of coronaviruses

Since the outbreak of atypical pneumonia referred to as severe acute respiratory syndrome (SARS), more attentions have been paid to the relationships between the SARS-CoVs and the other coronaviruses, which would be helpful to discover drugs and develop vaccines against the virus. Generally, coronaviruses can be divided into three groups according to serotypes. Group I and group II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus. Group III contains only avian. Based on the operation measure, we next considered to infer the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. The 24 complete coronavirus genomes used in this article were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2. Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the LZ-word set with revised Lempel–Ziv complexity and calculate the similarity/dissimilarity using operation measure; secondly, by arranging all the similarity/dissimilarity into a matrix, we obtain a pair-wise matrix; finally, we put the pair-wise distance matrix into the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) program in the PHYLIP package (Felsenstein, 1989). Fig. 5(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed operation measure with SSM 5.

Table 2

The accession number, abbreviation, name and length for each of the 24 coronavirus genomes.

No	Accession	Group	Abbreviation	Genome	Length(nt)
1	NC_002645	I	HCoV-229E	Human coronavirus 229E	27,317
2	NC_002306	I	TGEV	Transmissible gastroenteritis virus	28,586
3	NC_002436	I	PEDV	Porcine epidemic diarrhea virus	28,033
4	U00735	II	BCoVM	Bovine coronavirus strain Mebus	31,032
5	AF391542	II	BCoVL	Bovine coronavirus isolate BCoV–LUN	31,028
6	AF220295	II	BCoVQ	Bovine coronavirus strain Quebec	31,100
7	NC_003045	II	BCoV	Bovine coronavirus	31,028
8	AF208067	II	MHVM	Murine hepatitis virus strain ML–10	31,100
9	AF201929	II	MHV2	Murine hepatitis virus stain 2	31,028
10	AF208066	II	MHVP	Murine hepatitis virus strain Penn 97–1	31,233
11	NC_001846	II	MHV	Murine hepatitis virus	31,276
12	NC_001451	III	IBV	Avian infectious bronchitis virus	27,608
13	AY278488	IV	BJ01	SARS coronavirus BJ01	29,725
14	AY278741	IV	Urbani	SARS coronavirus Urbani	29,727
15	AY278491	IV	HKU-39849	SARS coronavirus HKU-39849	29,742
16	AY278554	IV	CUHK-W1	SARS coronavirus CUHK–W1	29,736
17	AY282752	IV	CUHK-Su10	SARS coronavirus CUHK–Su10	29,736
18	AY283794	IV	SIN2500	SARS coronavirus Sin2500	29,711
19	AY283795	IV	SIN2677	SARS coronavirus Sin2677	29,705
20	AY283796	IV	SIN2679	SARS coronavirus Sin2679	29,711
21	AY283797	IV	SIN2748	SARS coronavirus Sin2748	29,706
22	AY283798	IV	SIN2774	SARS coronavirus Sin2774	29,711
23	AY291451	IV	TW1	SARS coronavirus TW1	29,729
24	NC_004718	IV	TOR2	SARS coronavirus	29,751

Fig. 5

Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed operation measure and (b) multiple alignment CLUSTAL X.

The accession number, abbreviation, name and length for each of the 24 coronavirus genomes. Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed operation measure and (b) multiple alignment CLUSTAL X. Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopted the form one to test the validity of our phylogenetic tree. Both two data sets were aligned with the multiple alignment CLUSTAL X and constructed the phylogenetic tree presented in Fig. 5(b). Fig. 5(a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004, Zheng et al., 2005) and that of the multiple alignment Fig. 5(b) in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with the fact that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (Rota et al., 2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses.

Conclusion

Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. This work presented a novel method to compare biological sequence with the revised Lempel–Ziv complexity. Instead of focusing on the total number of components in the exhaustive history, we analyzed the distribution of components themselves. Then we defined transition and extension operations among the revised LZ-word sets and represented them in the operation figure. With the length of operations in mind, we designed an operation measure to estimate the similarity/dissimilarity of two biological sequences. To assess the effectiveness of the proposed method, two sets of evaluation experiments were taken, and its performance was further compared with alignment-based methods. The results demonstrate that the proposed method is efficient, which highlight the necessity for LZ-based method to consider the whole distribution of the components in the exhaustive history. Thus, this understanding can then be used to guide development of more powerful alignment-free for biological sequence comparison.

L(O)	L(O)=1	L(O)=2	…	L(O)=n	…
P	P(L(O)=1)	P(L(O)=2)	…	P(L(O)=n)	…

L(O)	L(O)=1	L(O)=2	L(O)=3
P	11/15	2/15	2/15

24 in total

1. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

Authors: T J Wu; Y C Hsieh; L A Li
Journal: Biometrics Date: 2001-06 Impact factor: 2.571

2. Integrated gene and species phylogenies from unaligned whole genome protein sequences.

Authors: Gary W Stuart; Karen Moffett; Steve Baker
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

Review 3. Alignment-free sequence comparison-a review.

Authors: Susana Vinga; Jonas Almeida
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

4. Characterization of a novel coronavirus associated with severe acute respiratory syndrome.

Authors: Paul A Rota; M Steven Oberste; Stephan S Monroe; W Allan Nix; Ray Campagnoli; Joseph P Icenogle; Silvia Peñaranda; Bettina Bankamp; Kaija Maher; Min-Hsin Chen; Suxiong Tong; Azaibi Tamin; Luis Lowe; Michael Frace; Joseph L DeRisi; Qi Chen; David Wang; Dean D Erdman; Teresa C T Peret; Cara Burns; Thomas G Ksiazek; Pierre E Rollin; Anthony Sanchez; Stephanie Liffick; Brian Holloway; Josef Limor; Karen McCaustland; Melissa Olsen-Rasmussen; Ron Fouchier; Stephan Günther; Albert D M E Osterhaus; Christian Drosten; Mark A Pallansch; Larry J Anderson; William J Bellini
Journal: Science Date: 2003-05-01 Impact factor: 47.728

5. A complexity-based method to compare RNA secondary structures and its application.

Authors: Shengli Zhang; Tianming Wang
Journal: J Biomol Struct Dyn Date: 2010-10

Review 6. Graphical representation of proteins.

Authors: Milan Randić; Jure Zupan; Alexandru T Balaban; Drazen Vikić-Topić; Dejan Plavsić
Journal: Chem Rev Date: 2010-10-12 Impact factor: 60.622

Review 7. Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis.

Authors: Ling Lu; Chunhua Li; Curt H Hagedorn
Journal: Rev Med Virol Date: 2006 Jan-Feb Impact factor: 6.989

8. A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping.

Authors: Zhihua Liu; Jihong Meng; Xiao Sun
Journal: Biochem Biophys Res Commun Date: 2008-01-28 Impact factor: 3.575