Literature DB >> 23876763

Study of LZ-word distribution and its application for sequence comparison.

Qi Dai1, Zhaofang Yan, Zhuoxing Shi, Xiaoqing Liu, Yuhua Yao, Pingan He.   

Abstract

Lempel-Ziv complexity has been widely used for sequence comparison and achieved promising results, but until now components' distribution in exhaustive history has not been studied. This paper investigated the whole distribution of LZ-words and presented a novel statistical method for sequence comparison. With the components' length in mind, we revised Lempel-Ziv complexity and obtained various sets of LZ-words. Instead of calculating the LZ-words' contents, we defined a series of set operations on LZ-word set to compare biological sequences. In order to assess the effectiveness of the proposed method, we performed two sets of experiments and compared it with alignment-based methods.
Copyright © 2013 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  Lempel–Ziv complexity; Phylogenetic analysis; Set operation; Word set

Mesh:

Year:  2013        PMID: 23876763      PMCID: PMC7094135          DOI: 10.1016/j.jtbi.2013.07.008

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


Introduction

With high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing new sequences to those with known functions is a key way of understanding the biology of an organism. Many methods have been proposed for sequence comparison. They can be categorized into two classes. One is alignment-based methods, in which dynamic programming is used to find an optimal alignment by assigning scores to different possible alignments and picking the alignment with the highest score. Several alignment-based algorithms have been proposed such as global alignment, local alignment, with or without overlap (Gotoh, 1982, Needleman and Wunsch, 1970, Smith and Waterman, 1981, Randic et al., 2011, Randic et al., 2013). Waterman (Waterman, 1995) and Durbin et al. (Durbin et al., 1998) provided comprehensive reviews about this method. However, the search for optimal solutions using sequence alignment has problems in: (i) computationally load with large biological databases and (ii) choice of the scoring schemes (Pham and Zuegg, 2004, Vinga and Almeida, 2003). Therefore, the emergence of research into the second class, alignment-free methods, is apparent and necessary to overcome critical limitations of alignment-based methods. Up to now, many alignment-free methods have been proposed, but they are still in the early development compared with alignment-based methods. One of the most widely used alignment-free approaches is statistical model, in which each sequence is first mapped into an m-dimensional vector according to its k-word frequencies, and sequence similarity can then be measured by distance measures, such as Euclidean distance (Blaisdell, 1986), Pearson's correlation coefficient (Fichant and Gautier, 1987), Kullback–Leibler discrepancy (Wu et al., 2001), Cosine distance (Stuart et al., 2002) among their corresponding vectors. Recently, Ewens and Grant (2005) studied probabilistic properties of words in sequences, deducted the exact distributions and evaluated its asymptotic approximations. When the k-words occurring in biological sequence are estimative probabilities rather than frequencies, they are more easily described by more complex models such as Markov model (Pham and Zuegg, 2004, Hao and Qi, 2004, Wu et al., 2001, Apostolico and Denas, 2008), mixed model (Kantorovitz et al., 2007) and Bernoulli model (Lu et al., 2008). Graphical representation is another widely used alignment-free method. It provides a simple way to view, sort and compare various gene sequences with their intuitive pictures and pattern. Randic et al. gave a comprehensive review on these methods (Randic, 2013b, Randic et al., 2003). In order to facilitate comparison of different biological sequences, they transformed graphical representations into some mathematical objects such as E matrix (Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005), D/D matrix(Li and Wang, 2003, Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005), L/L matrix (Randic, 2013a, Li and Wang, 2003, Yao and Wang, 2004, Liao and Wang, 2004, Song and Tang, 2005) and their “high order” matrices (Yao and Wang, 2004, Song and Tang, 2005). Once a matrix was given, they calculated matrix invariants as descriptors of the sequence, such as the average matrix element, the average row sum, the leading eigen value and the Wiener number. But, for long sequences, these methods become less useful because they require complex repetitive computation to get matrix invariants. Recently, Out and Sayood introduced Lempel–Ziv (LZ) complexity to compute the distance between two DNA sequences (Out and Sayood, 2003). Because it is based on exact direct repeats, the LZ complexity works well with the small DNA alphabet. Unlike DNA sequences, protein sequences and RNA secondary structures consist of more complex alphabets and structure information, which poses more of a challenge for LZ complexity. So Bacha and Baurain, Liu and Wang, Chen and Zhang presented several strategies in which protein sequences or RNA secondary structures were encoded to a new alphabet prior to computation of the LZ complexity (Bacha and Baurain, 2005, Liu and Wang, 2006, Zhang and Chen, 2010, Zhang and Wang, 2010, Chen and Zhang, 2012). Zhang et al. found that the LZ complexity is strongly correlated with sequence length and proposed a normalized LZ complexity for sequence comparison (Zhang et al., 2009). Taking into account a specific kind of the inexact copy in the text, Li et al. generalized the LZ complexity and proposed a new sequence distance measure for sequence comparison (Li et al., 2010). Liu et al. introduced relative LZ complexity to depict the complexity relationship between two sequences (Liu et al., 2012). All above LZ-based methods have achieved promising results in biological sequence comparison, but they generally placed a heavy emphasis on the number of components in the exhaustive history, so little attention has been paid to the components themselves. In this paper, we used the proposed revised LZ complexity to obtain a series of LZ-words from the exhaustive history of biological sequences. Based on the LZ-word distributions, we constructed a sorted union LZ-word set from which an indicator sequence was obtained. We then calculated numerical characteristics of the indicator sequence to compare biological sequences. The performance of the proposed method was evaluated by the phylogenetic analysis and comparison with alignment-based method.

Method

LZ-words of DNA sequences

Given a finite alphabet , let U, V and W be sequences over it. L(U) is the length of the sequence U, is the i-th element of U, and is the subsequence of U starting at position i and ending at position j. Here, , for . Concatenating V and W can construct a new sequence U=VW, in this equation, V is named “a prefix” of U, and U is called “an extension” of V if there exists an integer i such that . An extension U=VW of V is reproducible from V denoted by , if there exists an integer such that , for k=1, 2,…, L(W). A non-null sequence U is producible from its prefix , denoted by , if . For example: with p=1. Note that, the producibility allows for an extra different symbol at the end. Usually, a DNA primary sequence can be taken as a string of letters A, G, C, and T, which denote the four nucleic acid bases: adenine, guanine, cytosine, and thymine, respectively. Let to be a DNA sequence. To indicate a substring of S that starts at position i and ends at position j, we write , where is, for . Any sequence S can be built using a production process where at its ith step , which is described as following: At the beginning, we had a null-sequence, denoted by . We then added a prefix to and obtained a new sequence S. If , we added a symbol “”after . Let a prefix , checked if can be reproduced from the sequence . If could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol “”following QR. If R could be reproduced from the set, then checked again if can reproduced from the sequence . If so, checked again if can reproduced from the sequence , and so on. There two possible cases: in the case , we ended the procedure and got new prefix ; in another case cannot be reproduced from the sequence, we got a new prefix QR and added a symbol“” behind it. Repeated the step (2) until produce S. Instead of focusing on the total number of components in the exhaustive history, we analyzed the components themselves. For convenience, we denoted a component in the exhaustive history as a LZ-word, and all the components in the exhaustive history as a LZ-word set. For example, the LZ-words of S=ATGGTCGGTTTC can be gotten through the following steps, where is used to separate the decomposition component: Generate a novel symbol A: Ø+A→A. Generate a novel symbol T: A+T→AT. Generate a novel symbol G: AT+G→ATG. Copy the longest fragment+generate a additional symbol GT: ATG+GT→ATGGT. Generate a novel symbol C: ATGGT+C→ATGGTC. Copy the longest fragment+generate a additional symbol GGTT: ATGGTC→ATGGTCGGTT. Copy the longest fragment TC: ATGGTCGGTT→ATGGTC GGTTTC. A, T, G, GT, C, GGTT and TC are the LZ-words of the sequence S. And {A, T, G, GT, C, GGTT, TC} is the LZ-word set of the sequence S.

Revised LZ-words of DNA sequences

LZ complexity of a sequence is measured by the minimal number of steps required for its synthesis in a certain process. For each step only two operations are allowed in the process: either generating an additional symbol which ensures the uniqueness of each component or copying the longest fragment from the part of a synthesized sequence. When a new decomposition component is generated, it should be checked whether it is copied from the longest fragment of the . Consequently, the length of LZ-word inevitably becomes large as production process going on. With this problem in mind, we proposed a revised LZ complexity that is described as following: At the beginning, we had a null-sequence, denoted by. We then added a prefix to and obtained a new sequence S. If , we added a symbol“”after . Let a prefix , checked if can be reproduced from the set. If could not be reproduced from the set, then joined Q and R to get a new prefix QR, and added a symbol“”following QR. If R could be reproduced from the set, then checked again if can reproduced from the set. If so, checked again if can reproduced from the set, and so on. There two possible cases: in the case , we ended the procedure and got new prefix ; in another case cannot be reproduced from the set , we got a new prefix QR and added a symbol “” behind it. Repeated the step (2) until produce S. Take the above sequence S=ATGGTCGGTTTC as an example, we obtained its revised LZ-words through the following steps, where is used to separate the decomposition component: Generate a novel symbol A: Ø+A→A. Generate a novel symbol T: A+T→AT. Generate a novel symbol G: AT+G→ATG. Copy the fragment G+ generate a additional symbol T: ATG+GT→ATGGT. Generate a novel symbol C: ATGGT+C→ATGGTC. Copy the fragment G+generate a additional symbol G: ATGGTC→ATGGTCGG. Copy the fragment T+generate a additional symbol T: ATGGTC→ATGGTCGG⁎TT. Copy the fragment T+generate a additional symbol C: ATGGTCGGTT→ATGGTC GG TTTC. A, T, G, GT, C, GG, TT and TC are revised LZ-words of the sequence S. And {A, T, G, GT, C, GG, TT, TC} is revised LZ-word set of the sequence S. It is interesting to note that the maximum length of the revised LZ-word set is 2, significantly smaller than that of the LZ-word set.

Operation measure between different revised LZ-word sets

Given a DNA sequence, we can get a revised LZ-word set. Here, we are interested not only in using the revised LZ-word set to numerically characterize the biological sequences, but also in facilitating comparison of biological sequences. There is a large body of literatures on word statistics, where a sequence is interpreted as a succession of symbols (Reinert et al., 2000). A k-word is a series of k consecutive letters in a sequence. The word statistical analysis consists of counting occurrences of words and calculating their numerical characteristics. The standard approach for counting k-words in a sequence of length m is to use a sliding window of length k, shifting the frame one base at a time from position1 to m-k+1. Instead of counting the LZ-words' content, we analyzed the distribution diversity of revised LZ-words and designed an operation measure to compare biological sequences. Given two DNA sequences X and Y, we obtained their revised LZ-word sets and . We then blended and to compose anther set According to the length of revised LZ-words, the set is divided into several mutually exclusive sets whereWe then lined the elements of the set in the lexicographic order and got an ordered setFor example, if X=ATGCGTCGGTCCACCCACGTA and Y=ATCGGTCTGTTACAGACTACG are two given DNA sequences, we can get there , and sets: ={A, C, G, T, CA, CC, CT, GT, CAC, GTA, GTC}, ={A, C, G, T, AC, AG, CT, GT, ACT, GTC, GTT}, ={A, A, C, C, G, G, T, T, AC, AG, CA, CC, CT, CT, GT, GT, ACT, CAC, GTA, GTC, GTC, GTT}. Now we focus on the blend degree of two biological sequences. Given any pair of neighboring elements in set, there are two possible cases: if one is from () and the other is from (), we suppose there is transition operation () between them. Otherwise, they may both come from the same set (), we suppose there is extension operation (—) between them. Take above two sequences X and Y for an example, we first listed all the elements of the sets in a line with “” denoting them, and list all the elements of the sets in a line with “” denoting them. We then presented all the operations between the sets and the sets based on the set, which is shown in Fig. 1.
Fig. 1

All the transition operations and extension operations between the sets and the sets according to the set.

All the transition operations and extension operations between the sets and the sets according to the set. It is interesting to note that the transition operations in the operation figure indicate the similarity between the sets and the set, and the extension operations imply their diversity. That is to say, the more the extension operations are, the more similar the sets and the set are. According to that, we define length of the operations as follows: Given an operation with length, we counted its total appearances () in operation figure. Since varies with different value, it can be regarded as a discrete random variable. Given a random variable, and a positive integer , is the probability that takes the value The collection of pairs, for all positive integer , is the probability distribution of the listed as follows:Take all the operations in Fig. 1 for an example, the probability distribution of the operations isBased on the operation distribution function, we calculated its expectation and propose an operation measure () between two sequence X and Y, , the average length of the operation, is depended on both the extension operations and transition operations. It is important to note that only satisfies the identity and symmetry, it does not satisfy inequality conditions. So it is only a dissimilarity measure for sequence comparison. We are interested in for two reasons. First of all, it provides an opportunity to study the components' distribution which is, in some ways, more singular than the total number of components in the exhaustive history. The second reason involves the lengths of the operations because differencing lengths of the operations strengthens the effects of the different operations.

Results and discussion

Comparison of component distribution in the exhaustive history between Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity

One of the characteristics of the revised Lempel–Ziv complexity is to check whethercan be reproduced from the set instead of from the set . To find their difference, we compared their LZ-word's distributions. We first compared their component difference in the exhaustive histories. For example, HCoV-229E is a given sequence of Human coronavirus, its length is 27,317 with accession number NC_002645. With Lempel–Ziv complexity and revised Lempel–Ziv complexity, we got two exhaustive histories. We then deleted all the symbols “” in the exhaustive histories and obtained two new deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. It is difficult for us to observe sequence difference directly, but we can calculate k-word counts of the deduced sequences to assess their difference. Fig. 2 is the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4. Interestingly, the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ are similar in Fig. 2. That is to say, the sequence information held through Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity operation is similar.
Fig. 2

The comparison of k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4.

The comparison of k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ with k from 1 to 4. In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique used to compare means of two or more samples (using the F distribution). Here, we used a one-way ANOVA to test whether the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ differ from each other. The F value obtained by one-way ANOVA test tells us whether the data is significantly different from the Gaussian distribution or not. We rejected the hypothesis if the test is significant at the 0.05 level. Since the F-value is 0.91>F0.05, there is no significant difference between the k-word counts of the deduced sequences HCoV-229E_LZ and HCoV-229E_RLZ. That is to say, the components in the exhaustive history between Lempel–Ziv complexity and revised Lempel–Ziv complexity are similar. We found that the total number of components in the exhaustive history with the revised Lempel–Ziv (LZ) complexity algorithm is 4491, which is 1026 larger than Lempel–Ziv (LZ) complexity algorithm. In addition to components' distribution in the exhaustive history, we also compared the distribution of lengths of the components in the exhaustive history. Take the HCoV-229E as an example, Fig. 3 is the comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv complexity and revised Lempel–Ziv complexity. It is interesting to note that there is a great difference between the lengths of the components in the exhaustive history. The maximum length of the components obtained by Lempel–Ziv complexity is 16, while that of the components obtained by revised Lempel–Ziv complexity is only 10. The most appearance length of the components obtained by revised Lempel–Ziv complexity is 6, which is 2 smaller than that of the components obtained by Lempel–Ziv complexity.
Fig. 3

Comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity.

Comparison of length distribution of the components in the exhaustive history obtained by Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity. Comparison between Lempel–Ziv (LZ) complexity and revised Lempel–Ziv complexity illustrates that they can both extract the similar information of the primary sequences, but the component lengths in the exhaustive history obtained by the revised Lempel–Ziv complexity are obviously smaller than Lempel–Ziv complexity. So the revised Lempel–Ziv complexity is a better way to make the components easier to handle.

Influence of set splitting methods on operation measure

Given two DNA sequences X and Y, we obtained their revised LZ-word sets and with the revised Lempel–Ziv complexity. We then blended and to compose anther set . In order to highlight the influence of different LZ-words' size, we divided the set into several mutually exclusive sets according to the length of revised LZ-word t. It is worthy to note that mutually exclusive sets rely heavily on set splitting methods. In order to evaluate the influence of the set splitting methods, we adopted the operation measure to classify HEV Genotypes with step-wise refinement of set splitting methods. HEV (Hepatitis E virus) is a non-enveloped, positive-sense, single-stranded RNA virus and belongs to Hepevirus genus under the separate family of Hepeviridea (Lu et al., 2006). The genome of HEV is approximately 7.2 kb in length and contains a short 5′ untranslated region (5′ UTR), three overlapped open reading frames (ORF1, ORF2, and ORF3′) and a short 3′ UTR. We retrieved a total of 48 full-length HEV genome sequences from NCBI (http://www.ncbi.nlm.nih.gov/). Abbreviation for the strains, accession number, nucleotide length, country, and genotype of all HEV genomes (Lu et al., 2006) are described in Table 1. And the 48 HEV genomes were distinctly clustered into four genotypes by the traditional classification (Liu et al., 2008).
Table 1

Abbreviation for the strains, accession number, nucleotide length, genotype, acronym and country for each of the 48 complete HEV genomes.

NoStrain nameAccessionLengthGenotypeAbbreviationCountry
1B1 (Bur-82)M732187207IAABurma (Rangoon)
2B2 (Bur-86)D103307194IABBurma (Rangoon)
3I2 [Mad-93]X994417194IACIndia (Madras)
4I3AF0762397194IADIndia (Hyderabad)
5Np1(TK15/92)AF0518307199IAENepal (Kathamandu)
6P2[Abb-2B]AF1858227143IAFPakistan (Abbottabad)
7Yam-67AF4594387206IAGIndia (Yamuna Nagar)
8C1(CHT-88)D110927207IAHChina (Xinjiang, Hetian)
9C2(KS2–87)L255957221IAIChina (Xinjiang, Kashi)
10C3(CHT-87)L088167176IAJChina (Xinjiang, Hetian)
11C4(Uigh179)D110937194IAKChina (Xinjiang, Uighur)
12China HebeiM941777200IARChina (Hebei)
13P1(Sar-55)M805817138IAMPakistan (Sargodha)
14I1(FHF)X982927202IANIndia
15MoroccoAY2302027212IAOMorocco
16T3AY2048777170IAPChad
17M1M745067180IIBBMexico (Telixtac)
18HE-JA10AB0898247262IIICAJapan (Tokyo)
19JKN-SapAB0749187256IIICBJapan (Sapporo)
20JMY-HAWAB0749207240IIICCJapan (Sapporo)
21swUS1AF0828437207IIICDUSA
22US1AF0606687202IIICEUSA (Minnesota)
23US2AF0606697277IIICFUSA (Tennessee)
24JBOAR1-Hyo04AB1890707247IIICGJapan (Hyogo)
25JDEER-Hyo03LAB1890717230IIICHJapan (Hyogo)
26JJT-KANAB0913947218IIICIJapan (Kanagawa)
27JMO-Hyo03LAB1890727180IIICJJapan (Hyogo)
28JRA1AP0034307230IIICKJapan (Tokyo)
29JSO-Hyo03LAB1890737180IIICRJapan (Tokyo)
30JTH-Hyo03LAB1890747180IIICMJapan (Tokyo)
31JYO-Hyo03LAB1890757180IIICNJapan (Tokyo)
32swJ570AB0739127257IIICOJapan (Tochigi)
33KyrgyzAF4557847239IIICPKyrgyzstan
34ArkellAY1154887255IIICQCanada (Ontario, Guelph)
35HE-JA1AB0978127258IVDAJapan (Hokkaido)
36HE-JK4AB0993477250IVDBJapan (Tochigi)
37HE-JI4AB0805757186IVDCJapan (Tochigi)
38JAK-SaiAB0749157236IVDDJapan (Saitama)
39JKK-SapAB0749177235IVDEJapan (Sapporo)
40JSM-Sap95AB1617177202IVDFJapan (Hokkaido)
41JSN-Sap-FHAB0913957234IVDGJapan (Hokkaido)
42JSN-Sap-FH02CAB2002397251IVDHJapan (Hokkaido)
43JTS-Sap02AB1617187202IVDIJapan (Hokkaido)
44JYW-Sap02AB1617197202IVDJJapan (Hokkaido)
45swJ13–1AB0978117258IVDKJapan (Hokkaido)
46swCH25AY5941997270IVDRChina (Uighur)
47T1AJ2721087232IVDMChina (Beijing)
48CCC220AB1085377193IVDNChina (Changchun)
Abbreviation for the strains, accession number, nucleotide length, genotype, acronym and country for each of the 48 complete HEV genomes. This experiment aims at assessing how well the operation measure with step-wise refinement of set splitting methods performs on classification. Here, set splitting methods with the step-wise refinement (SSM) are:In relation to the clustering literature (Handl et al., 2005), Neighbor-joining (Felsenstein, 1989), a classic tree construction algorithm, can be considered as hierarchical methods. These results are represented in Fig. 4.
Fig. 4

Cluster trees of 48 HEV genomes using tree construction algorithm Neighbor-joining based on the proposed operation measure with SSM1, SSM2, SSM3, SSM4, and SSM5.

Cluster trees of 48 HEV genomes using tree construction algorithm Neighbor-joining based on the proposed operation measure with SSM1, SSM2, SSM3, SSM4, and SSM5. To evaluate the performance of the operation measure for HEV genotypes classification, we counted the number of misplaced HEV genotype against a gold standard. For the classification of HEV genotypes, we took the traditional classification as the gold standard (Lu et al., 2006). The numbers of misplaced HEV genotype for the operation measure with SSM 1, SSM 2, SSM 3, SSM 4, and SSM 5 are 1, 1, 2, 1 and 0, respectively. These results indicate that the higher the refinement scheme is, the higher the operation measure efficiency is.

Phylogenetic analysis of coronaviruses

Since the outbreak of atypical pneumonia referred to as severe acute respiratory syndrome (SARS), more attentions have been paid to the relationships between the SARS-CoVs and the other coronaviruses, which would be helpful to discover drugs and develop vaccines against the virus. Generally, coronaviruses can be divided into three groups according to serotypes. Group I and group II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus. Group III contains only avian. Based on the operation measure, we next considered to infer the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. The 24 complete coronavirus genomes used in this article were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 2. Given a set of biological sequences, their phylogenetic relationship can be obtained through the following main operations: firstly, we construct the LZ-word set with revised Lempel–Ziv complexity and calculate the similarity/dissimilarity using operation measure; secondly, by arranging all the similarity/dissimilarity into a matrix, we obtain a pair-wise matrix; finally, we put the pair-wise distance matrix into the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) program in the PHYLIP package (Felsenstein, 1989). Fig. 5(a) is phylogenetic tree of the 24 coronavirus genomes obtained using the proposed operation measure with SSM 5.
Table 2

The accession number, abbreviation, name and length for each of the 24 coronavirus genomes.

NoAccessionGroupAbbreviationGenomeLength(nt)
1NC_002645IHCoV-229EHuman coronavirus 229E27,317
2NC_002306ITGEVTransmissible gastroenteritis virus28,586
3NC_002436IPEDVPorcine epidemic diarrhea virus28,033
4U00735IIBCoVMBovine coronavirus strain Mebus31,032
5AF391542IIBCoVLBovine coronavirus isolate BCoV–LUN31,028
6AF220295IIBCoVQBovine coronavirus strain Quebec31,100
7NC_003045IIBCoVBovine coronavirus31,028
8AF208067IIMHVMMurine hepatitis virus strain ML–1031,100
9AF201929IIMHV2Murine hepatitis virus stain 231,028
10AF208066IIMHVPMurine hepatitis virus strain Penn 97–131,233
11NC_001846IIMHVMurine hepatitis virus31,276
12NC_001451IIIIBVAvian infectious bronchitis virus27,608
13AY278488IVBJ01SARS coronavirus BJ0129,725
14AY278741IVUrbaniSARS coronavirus Urbani29,727
15AY278491IVHKU-39849SARS coronavirus HKU-3984929,742
16AY278554IVCUHK-W1SARS coronavirus CUHK–W129,736
17AY282752IVCUHK-Su10SARS coronavirus CUHK–Su1029,736
18AY283794IVSIN2500SARS coronavirus Sin250029,711
19AY283795IVSIN2677SARS coronavirus Sin267729,705
20AY283796IVSIN2679SARS coronavirus Sin267929,711
21AY283797IVSIN2748SARS coronavirus Sin274829,706
22AY283798IVSIN2774SARS coronavirus Sin277429,711
23AY291451IVTW1SARS coronavirus TW129,729
24NC_004718IVTOR2SARS coronavirus29,751
Fig. 5

Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed operation measure and (b) multiple alignment CLUSTAL X.

The accession number, abbreviation, name and length for each of the 24 coronavirus genomes. Phylogenetic tree of 24 coronavirus genomes based on (a) the proposed operation measure and (b) multiple alignment CLUSTAL X. Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopted the form one to test the validity of our phylogenetic tree. Both two data sets were aligned with the multiple alignment CLUSTAL X and constructed the phylogenetic tree presented in Fig. 5(b). Fig. 5(a) shows that our results are quite consistent with the authoritative results (Gu et al., 2004, Zheng et al., 2005) and that of the multiple alignment Fig. 5(b) in the following aspects. First of all, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with the fact that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Rota et al. (Rota et al., 2003) found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses.

Conclusion

Sequence comparison is one of the major goals of sequence analysis, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Despite the prevalence of the alignment-based methods, it is also noteworthy that it is computationally intensive and consequently unpractical for querying large data sets. Therefore, considerable efforts have been made to seek for alternative methods for sequence comparison. This work presented a novel method to compare biological sequence with the revised Lempel–Ziv complexity. Instead of focusing on the total number of components in the exhaustive history, we analyzed the distribution of components themselves. Then we defined transition and extension operations among the revised LZ-word sets and represented them in the operation figure. With the length of operations in mind, we designed an operation measure to estimate the similarity/dissimilarity of two biological sequences. To assess the effectiveness of the proposed method, two sets of evaluation experiments were taken, and its performance was further compared with alignment-based methods. The results demonstrate that the proposed method is efficient, which highlight the necessity for LZ-based method to consider the whole distribution of the components in the exhaustive history. Thus, this understanding can then be used to guide development of more powerful alignment-free for biological sequence comparison.
L(O)L(O)=1L(O)=2L(O)=n
PP(L(O)=1)P(L(O)=2)P(L(O)=n)
L(O)L(O)=1L(O)=2L(O)=3
P11/152/152/15
  24 in total

1.  Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition.

Authors:  T J Wu; Y C Hsieh; L A Li
Journal:  Biometrics       Date:  2001-06       Impact factor: 2.571

2.  Integrated gene and species phylogenies from unaligned whole genome protein sequences.

Authors:  Gary W Stuart; Karen Moffett; Steve Baker
Journal:  Bioinformatics       Date:  2002-01       Impact factor: 6.937

Review 3.  Alignment-free sequence comparison-a review.

Authors:  Susana Vinga; Jonas Almeida
Journal:  Bioinformatics       Date:  2003-03-01       Impact factor: 6.937

4.  Characterization of a novel coronavirus associated with severe acute respiratory syndrome.

Authors:  Paul A Rota; M Steven Oberste; Stephan S Monroe; W Allan Nix; Ray Campagnoli; Joseph P Icenogle; Silvia Peñaranda; Bettina Bankamp; Kaija Maher; Min-Hsin Chen; Suxiong Tong; Azaibi Tamin; Luis Lowe; Michael Frace; Joseph L DeRisi; Qi Chen; David Wang; Dean D Erdman; Teresa C T Peret; Cara Burns; Thomas G Ksiazek; Pierre E Rollin; Anthony Sanchez; Stephanie Liffick; Brian Holloway; Josef Limor; Karen McCaustland; Melissa Olsen-Rasmussen; Ron Fouchier; Stephan Günther; Albert D M E Osterhaus; Christian Drosten; Mark A Pallansch; Larry J Anderson; William J Bellini
Journal:  Science       Date:  2003-05-01       Impact factor: 47.728

5.  A complexity-based method to compare RNA secondary structures and its application.

Authors:  Shengli Zhang; Tianming Wang
Journal:  J Biomol Struct Dyn       Date:  2010-10

Review 6.  Graphical representation of proteins.

Authors:  Milan Randić; Jure Zupan; Alexandru T Balaban; Drazen Vikić-Topić; Dejan Plavsić
Journal:  Chem Rev       Date:  2010-10-12       Impact factor: 60.622

Review 7.  Phylogenetic analysis of global hepatitis E virus sequences: genetic diversity, subtypes and zoonosis.

Authors:  Ling Lu; Chunhua Li; Curt H Hagedorn
Journal:  Rev Med Virol       Date:  2006 Jan-Feb       Impact factor: 6.989

8.  A novel feature-based method for whole genome phylogenetic analysis without alignment: application to HEV genotyping and subtyping.

Authors:  Zhihua Liu; Jihong Meng; Xiao Sun
Journal:  Biochem Biophys Res Commun       Date:  2008-01-28       Impact factor: 3.575

Review 9.  Computational cluster validation in post-genomic data analysis.

Authors:  Julia Handl; Joshua Knowles; Douglas B Kell
Journal:  Bioinformatics       Date:  2005-05-24       Impact factor: 6.937

10.  Coronavirus phylogeny based on a geometric approach.

Authors:  Wen-Xin Zheng; Ling-Ling Chen; Hong-Yu Ou; Feng Gao; Chun-Ting Zhang
Journal:  Mol Phylogenet Evol       Date:  2005-08       Impact factor: 4.286

View more
  1 in total

1.  One novel representation of DNA sequence based on the global and local position information.

Authors:  Zhiyi Mo; Wen Zhu; Yi Sun; Qilin Xiang; Ming Zheng; Min Chen; Zejun Li
Journal:  Sci Rep       Date:  2018-05-15       Impact factor: 4.379

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.