Literature DB >> 32226089

A relative Lempel-Ziv complexity: Application to comparing biological sequences.

Liwei Liu1, Dongbo Li2, Fenglan Bai1.   

Abstract

One of the main tasks in biological sequence analysis is biological sequence comparison. Numerous efficient methods have been developed for sequence comparison. Traditional sequence comparison is based on sequence alignment. In this report, we propose a novel alignment-free method based on the relative Lempel-Ziv complexity to compare biological sequences. The vertebrate transferring genomes and the spike protein sequences are prepared and tested to evaluate the validity of the method. We use this method to build phylogenetic tree of two groups of the sequences. The result demonstrates that our method is powerful and efficient.
Copyright © 2012 Elsevier B.V. All rights reserved.

Entities:  

Year:  2012        PMID: 32226089      PMCID: PMC7094452          DOI: 10.1016/j.cplett.2012.01.061

Source DB:  PubMed          Journal:  Chem Phys Lett        ISSN: 0009-2614            Impact factor:   2.328


Introduction

Multiple sequence alignments are an essential tool for protein structure and function prediction, phylogeny inference and other common tasks in sequence analysis. Traditional sequence alignment method is much empirical to select a sequence alignment scoring matrix and gap penalty parameters, the difference of which may affect alignment results tremendously. In order to avoid this problem, during the last twenty years, several alignment-free methods for sequence comparison have arisen much interest in the field of computational biology. For example, graphical representations of biological sequences have been one kind of alignment-free methods to sequence analysis. Graphical method for visualizing DNA sequence is early proposed by Hamori and Ruskin [1]. Using graphic approaches to study biological systems can provide intuitive and useful insights, as indicated by many previous studies on a series of important biological topics [2], [3]. Liao et al. reported on a sort of binary coding method of RNA secondary structures and coding rules based on the exclusive-OR operation [4]. In another paper, Liao et al. propose a 4-D representation of RNA secondary structures and outline an approach to make mathematical analysis and to compute the similarities between RNA secondary structures [5]. Liao et al. proposed a 6D representation of protein sequences consisting of 20 amino acids. Based on this 6D representation, they provided a proteome distance measure for constructing phylogenic tree [6]. In the report [7], Jia et al. proposed a novel 2D representation for protein secondary structure sequences. Word-based measure is one of widely used alignment-free approaches. In this method, each sequence is mapped into an n-dimensional vector according to its k-word frequencies, and similarity between the sequences is then defined by n-dimensional vector [8], [9]. The LZ algorithm is another widely used alignment-free algorithm. Based on the relative information between the sequences using Lempel–Ziv complexity, a new sequence distance measure is proposed [10]. Zhang and Wang used conditional The LZ algorithm to compare the linear characteristic sequences of RNA secondary structures [11]. Our approach is motivated by the LZ algorithm. In order to depict the complexity relationship between two sequences, in the present report, we use relative the LZ complexity to analyze biological sequences. The proposed method is tested by phylogenetic analysis on two different data sets: 24 transferring sequences from vertebrates and 26 spike protein sequences from coronavirus. The results demonstrate that relative LZ complexity provides more information about phylogenetic and improves the efficiency of sequence comparison.

Methods

The LZ algorithm was developed to analyze the complexity of linear sequences by Lempel and Ziv [12]. Lempel–Ziv (LZ) complexity of a sequence is measured by the minimal number of steps required for its synthesis in a certain process. For each step only two operations are allowed in the process: either generating an additional symbol which ensures the uniqueness of each component or copying the longest fragment from the part of a synthesized sequence [13]. In recent years, some authors applied the algorithm to compare sequences and construct phylogenic trees. For instance, Otu and Sayood applied the LZ algorithm to phylogenic analysis and had successfully constructed phylogenic trees for real and simulated DNA data sets [10]. Liu and Wang take the physicochemical properties of amino acids into account, and used the LZ algorithm to construct phylogenic trees [14]. In this study, the measure of relative LZ complexity between two sequences is proposed according to the principle of the LZ complexity. The LZ complexity distance metric between two non-null sequences is defined by utilizing relative LZ complexity. Next, we will give some basic definitions and properties about relative LZ complexity. Let the sequence S  =  S 1 S 2 S, l(S) =  n represents the length of S, the subsequence S +1 S of S be denoted as S(i,  j). The set that contains all subsequence S(i,  j) is called the vocabulary v(S) of S. Note that S(i,  j) =  φ, for i  >  j . Let S  =  S 1 S 2 S be a non-null sequence, then produce S from null sequence according the following algorithm [15]. At the beginning we have a null-sequence φ, then add prefix S  =  S 1. If n  > 1, need add a dot after S 1. Let a prefix Q  =  S 1 S 2 S, 0 <  r  <  n be available, check if R  =  S can be reproduced from S(1,  r), or if R cannot reproduced from a subsequence of S(1,  r), then join Q and R to get a new prefix QR, and add a dot following QR. If R  =  S can be reproduced from a subsequence of S(1,  r), then check again if R  =  S can reproduced from S(1,  r  + 1). If so, check again if R  =  S can reproduced from S(1,  r  + 2) and so on. There two possible cases: In the case R  =  S S, then we end the procedure, and get new prefix QR  =  S, in another case R  =  S S cannot be reproduced from any subsequence of S(1,  k  - 1), then get a new prefix QR and add a dot behind it. Repeat the step (2) until produce S. This algorithm is the LZ algorithm. Then, the sequence S can partition into some subsequences that arrange one after another. Denote this partition as follows: Lempel and Ziv proved the exclusive partition about H and defined the LZ complexity C(S) of S as the number of subsequence in H(S), namely C(S) =  m. For instance, H(S) of the sequence S  =  AAACACCACAC is , so C(S) = 4. Given two sequences Q, R, according to the theory of Lempel and Ziv about sequence partition, we can also partition sequence Q into the subsequences one after another, called it relative partition of sequence Q corresponding to sequence R. Denote it as follows: H(Q|R) satisfies the following three properties: h 0  = 0; . The complexity r(Q|R) of Q corresponding to sequence R is the number of subsequence in H(Q|R), namely r(Q|R) =  m′. This relative partition is also exclusive. In order to compute the relative LZ complexity of Q, we only need to add the sequence R as the prefix of sequence Q. The flow diagram for the algorithm to calculate r is shown by Figure 1 . The time complexity of this algorithm is O(l(Q) ×  l(R)). The time complexity of Zhang’s algorithm is O(l(Q) × (l(Q) +  l(R)) [11].
Figure 1

Flow diagram for the algorithm to calculate the r.

Flow diagram for the algorithm to calculate the r. Clearly, . If Q which deletes its last character is subsequence of R, the result is . If all characters in Q do not belong to R, the result is .Genome rearrangement is an important area of computational biology. The goal is to find the shortest sequence of genome arrangements operations that transform one genome into another. There are several basic operations: reversal, translocation and transposition. Traditional sequence alignment methods can only operate locally (e.g. insert, delete, replace) and thus ignore aspects of global sequence information (e.g. reversal, translocation, transposition). For example, the sequence Q  =  AAAAAAAAGGGGGGGG is converted to R  =  GGGGGGGGAAAAAAAA by reversal. We use CLUSTAL 2.1 to compare these two sequences. The result is This result suggests that the time to perform the operations (insert, delete) is 16. On the other hand, we calculate the r(Q|R) = 2. Clearly, the r(Q|R) can to a great extent reflect the sequence similarity. In order to eliminate the effect of the length of sequence Q on the distance measure, we normalize the distance measure as r(Q|R)/l(Q). Therefore we define the distance as

Results

In this section, we illustrate the performance of our method on both DNA sequences and protein sequences. The validity of a phylogenetic tree can be tested by comparing it with authoritative ones. All the phylogenetic trees are drawn by using Mega program.

Experiment 1: Phylogenetic trees of DNA sequences

In order to test the validity of our method, we select transferrin sequences from 24 vertebrates as a dataset [16]. Vertebrate transferrins (including lactoferrin and ovotransferrin) are iron-binding proteins found in blood serum, interstitial spaces, milk, tears, and egg whites [17]. It can be involved in iron storage and resistance to bacterial disease [16]. The 24 vertebrate transferrins genomes used in this report are downloaded from GenBank (data arepresented in Table 1 ).
Table 1

Transferrin sequences, sources, and accession numbers.

Sequence NameSpeciesAccession No.
Human TFHomo sapienS95936
Rabbit TFOryctolagus coniculusX58533
Rat TFRattus norvegicusD38380
Cow TFBos TaurusU02564
Buffalo LFBubalus arneeAJ005203
Cow LFBos TaurusX57084
Camel LFCamelus dromedariesAJ131674
Pig LFSus scrofaM92089
Human LFH. sapiensNM_002343
Mouse LFMus musculusNM_008522
Possum TFTrichosurus vulpeculaAF092510
Frog TFXenopus laevisX54530
Japanese flounder TFParalichthys olivaceusD88801
Atlantic salmon TFSalmo salarL20313
Brown trout TFSalmo truttaD89091
Lake trout TFSalvelinus namaycushD89090
Brook trout TFSalvelinus fontinalisD89089
Japanese char TFSalvelinus pluviusD89088
Chinook salmon TFOncorhynchus tshawytschaAH008271
Coho salmon TFOncorhynchus hisutchD89084
Sockeye salmon TFOncorhynchus nerkaD89085
Rainbow trout TFOncorhynchus mykissD89083
Amago salmon TFOncorhynchus masouD89086
Transferrin sequences, sources, and accession numbers. We will consider the 24 transferrin sequences and calculate their distances Eq. (1). By arranging all these distances into a matrix, a pair-wise distance matrix is derived. This distance matrix contains the distance information on the 24 transferrin sequences. Lastly, this pair-wise distance matrix may be input to the Neighbor-joining program in PHYLIP package for a phylogenetic tree. The result is shown in Figure 2 .
Figure 2

Phylogenetic tree obtained by our method using transferrin sequences.

Phylogenetic tree obtained by our method using transferrin sequences. Figure 2 presents the phylogenetic trees reconstructed by our method. From Figure 2 we can observe that all the proteins that belong to transferring (TF) proteins and lactoferrin (LF) proteins have been separated well and grouped into respective taxonomic classes accurately. Human TF, Rabbit TF, Cow TF and Rat TF are clustered into the same branch. The Rat TF, Cow TF are separated from Human TF and Rabbit TF. The tree in Figure 2 is the most consistent with the trees constructed by Ford [16], which is the most classical result in the publicized existing trees. Summing up, our method has significant advantage, and our results are almost agreement with that of previous studies.

Experiment 2: Phylogenetic trees of protein sequences

Phylogenetic analysis on genome sequences and protein sequences of coronaviruses has been studied by different methods, such as multiple sequence lignments, graphical representation, and word frequency. Here the phylogenetic tree for 26 spike protein sequences in Table 2 from coronavirus is instructed by our method, which is presented in Figure 3 .
Table 2

Coronavirus spike proteins sequences, sources, and accession numbers.

Sequence NameSpeciesAccession No.
TGEVTransmissible gastroenteritis virsNP_058424.1
PEDVPorcine epidemic diarrhea virusNP_ 598310.1
HCoV-OC43Human coronavirus OC43NP_ 937950.1
BCoVMBovine coronavirus strain MebusAAA66399.1
BCoVLBovine coronavirus isolate BCoV-LUNAAL57308.1
BCoVQBovine coronavirus strain QuebecAAL40400.1
BCOVBovine coronavirusNP_150077.1
MHVMMouse hepatitis virus strain ML-10AAF69344.1
MHVPMouse hepatitis virus strain Penn 97–1AAF69334.1
MHVJHMMurine hepatitis virus strain JHMYP_209233.1
MHVAMouse hepatitis virus strain MHV-A59C12 mutantAAB86819.1
IBVBJAvain infectious bronchitis virus isolate BJAAP92675.1
IBV AvainInfectious bronchitis virusNP_040831.1
GD03T0013SARS coronavirus GD03T0013AAS10463.1
PC4-127SARS coronavirus PC4-127AAU93318.1
PC4-137SARS coronavirus PC4-137AAV49720.1
Civet007SARS coronavirus civet007AAU04646.1
A022SARS coronavirus A022AAV91631.1
GD01SARS coronavirus GD01AAP51227.1
GZ02SARS coronavirus GZ02AAS00003.1
CUHK-W1SARS coronavirus CUHK-W1AAP13567.1
TOR2SARS coronavirus TOR2AAP41037.1
UrbaniSARS coronavirus UrbaniAAP13441.1
Frankfurt1SARS coronavirus Frankfurt 1AAP33697.1
Sino1-11SARS coronavirus Sin01-11AAR23250.1
Figure 3

Phylogenetic tree obtained by our method using spike protein sequences.

Coronavirus spike proteins sequences, sources, and accession numbers. Phylogenetic tree obtained by our method using spike protein sequences. On March 12, 2003, WHO issued a global alert on severe acute respiratory syndrome (SARS). Since the outbreak of atypical pneumonia referred to as SARS, some researchers have considered the mutation analysis and phylogenetic analysis [18], [19], [20]. Moreover, mutation analysis and phylogenetic analysis will help to develop effective vaccines. Based on the relative LZ complexity, we next consider to infer the phylogenetic relationships of coronaviruses with the spike protein sequences. The 26 spike protein sequences used in this report were downloaded from GenBank, of which 12 are from SARS-CoVs and 14 are from other groups of coronaviruses. The name, accession number and abbreviation for the 26 spike protein sequences are listed in Table 2. Given a set of protein sequences, their phylogenetic tree can be obtained through the following main operations: first, we calculate the relative LZ complexity for protein sequences; second, by arranging all these distances into a matrix, we obtain a pair-wise distance matrix. Finally, we put the pair-wise distance matrix into the neighbor-joining program in the PHYLIP package. We obtain the phylogenetic relationships drawn by MEGA program [21]. In Figure 3, we present the unrooted phylogenetic tree belonging to 26 species. As can be seen from Figure 3, SARS-CoVs appear to cluster together and form a separate branch, which can be easily distinguished from other coronaviruses. The topology of the tree obtained by our method is quite consistent with the results obtained by other authors [22], [23].

Discussion

Sequence comparison is rapidly becoming an essential tool for bioinformatics applications. It has been used to support other types of analysis, from searching a database with a query DNA sequence to the phylogenetic tree construction. Despite the prevalence of alignment-based methods, it is noteworthy that alignment-based method is computationally intensive and consequently unpractical for querying large data sets, which forces the use of some heuristics to reduce the running times, as exemplified by BLAST. Alignment-free comparison method is therefore of great value as it reduces the technical constraints of alignments. A novel alignment-free method for sequence comparison is proposed in this work. The relative LZ complexity has been introduced into biological sequence comparison. The time complexity of this algorithm is O(l(Q) ×  l(R)). The main advantage is that this method can extract repeated patterns from biological sequence. Therefore, when two sequences are compared, the subsequences that they share can be detected. In this report, we use the relative LZ complexity to describe the similarity of the biological sequences. The two experiments have shown that the approach proposed in this report is a powerful and useful tool for the comparison of biological sequence. Studies of the application of this method to whole coding DNA sequences, RNA sequence and protein sequence will appear in the future. Furthermore, this method can be used to prediction of protein secondary structure. The shortage of this method is that some information may be lost when protein primary structures are converted to protein feature sequences. However, the tests have proven that our method can extract phylogenetic information from proteins and hence it can complement phylogenetic analysis.
  14 in total

1.  Molecular evolution of transferrin: evidence for positive selection in salmonids.

Authors:  M J Ford
Journal:  Mol Biol Evol       Date:  2001-04       Impact factor: 16.240

2.  On the complexity measures of genetic sequences.

Authors:  V D Gusev; L A Nemytikova; N A Chuzhanova
Journal:  Bioinformatics       Date:  1999-12       Impact factor: 6.937

3.  A new sequence distance measure for phylogenetic tree construction.

Authors:  Hasan H Otu; Khalid Sayood
Journal:  Bioinformatics       Date:  2003-11-01       Impact factor: 6.937

4.  A complexity-based method to compare RNA secondary structures and its application.

Authors:  Shengli Zhang; Tianming Wang
Journal:  J Biomol Struct Dyn       Date:  2010-10

5.  Protein-based phylogenetic analysis by using hydropathy profile of amino acids.

Authors:  Na Liu; Tianming Wang
Journal:  FEBS Lett       Date:  2006-09-12       Impact factor: 4.124

6.  Comparison of TOPS strings based on LZ complexity.

Authors:  Liwei Liu; Tianming Wang
Journal:  J Theor Biol       Date:  2007-11-21       Impact factor: 2.691

7.  A binary coding method of RNA secondary structure and its application.

Authors:  Bo Liao; Weiyang Chen; Xingming Sun; Wen Zhu
Journal:  J Comput Chem       Date:  2009-11-15       Impact factor: 3.376

8.  H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences.

Authors:  E Hamori; J Ruskin
Journal:  J Biol Chem       Date:  1983-01-25       Impact factor: 5.157

9.  Using Gaussian model to improve biological sequence comparison.

Authors:  Qi Dai; Xiaoqing Liu; Lihua Li; Yuhua Yao; Bin Han; Lei Zhu
Journal:  J Comput Chem       Date:  2010-01-30       Impact factor: 3.376

10.  Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales.

Authors:  Wanjun Gu; Tong Zhou; Jianmin Ma; Xiao Sun; Zuhong Lu
Journal:  Virus Res       Date:  2004-05       Impact factor: 3.303

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.