Literature DB >> 16689687

LZ complexity distance of DNA sequences and its application in phylogenetic tree reconstruction.

Abstract

DNA sequences can be treated as finite-length symbol strings over a four-letter alphabet (A, C, T, G). As a universal and computable complexity measure, LZ complexity is valid to describe the complexity of DNA sequences. In this study, a concept of conditional LZ complexity between two sequences is proposed according to the principle of LZ complexity measure. An LZ complexity distance metric between two nonnull sequences is defined by utilizing conditional LZ complexity. Based on LZ complexity distance, a phylogenetic tree of 26 species of placental mammals (Eutheria) with three outgroup species was reconstructed from their complete mitochondrial genomes. On the debate that which two of the three main groups of placental mammals, namely Primates, Ferungulates, and Rodents, are more closely related, the phylogenetic tree reconstructed based on LZ complexity distance supports the suggestion that Primates and Ferungulates are more closely related.

Entities: Disease Species

Mesh：

Substances：

Year: 2005 PMID： 16689687 PMCID： PMC5172548 DOI： 10.1016/s1672-0229(05)03028-7

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Approaches of phylogenetic tree reconstruction using biological molecular data, such as DNA, RNA, and protein sequences, can be divided into two groups (. The first group reconstructs phylogenetic trees by evaluating the trees’ topology based on certain optimal criteria, among which the two most available ones are maximum parsimony and maximum likelihood. The second group utilizes various distance measures, in which a phylogenetic tree is reconstructed from a distance matrix that is obtained by calculating distances between every two sequences. Traditional sequence distance matrices include p-distance, Jukes-Cantor distance, Kimura distance, Gamma distance, and so on (, all of which require sequence alignment that is strict with the sequence data to be aligned. Generally, before sequence alignment, it is necessary to perform some pretreatments such as extracting related structure or function segments from primary sequences and performing gene prediction (. Furthermore, it is much empirical to select or create a sequence alignment score matrix (, the difference of which may affect alignment results tremendously. To overcome these problems, more and more researchers begin to try alignment-free methods for DNA sequence comparison and analysis (. Complexity is one of the most basic properties of a symbolic sequence. In respect that DNA sequences can be treated as finite-length symbol strings over a four-letter alphabet (A, C, T, G), DNA sequence complexity is much attractive to many researchers (. Kolmogorov complexity, the first formal theoretical description of sequence complexity, was proposed by Kolmogorov from the view of algorithm information theory (. Li et al. ( first introduced Kolmogorov complexity to DNA sequence analysis and proposed a DNA sequence distance matrix based on it. Because Kolmogorov complexity is not computable, Chen et al. ( made use of data compression gain to approximate Kolmogorov complexity. However, the generalization of the approximate method is greatly limited because the data compression gain varies evidently with the object to be compressed and the algorithm that a certain compressor uses (. In contrary, LZ complexity, another significant complexity measure proposed by Lempel and Ziv (, is easily computable and is also a universal depiction of sequence complexity. Based on the computational principle of LZ complexity, we propose a concept of conditional LZ complexity between two sequences. An LZ complexity distance metric is defined according to conditional LZ complexity. The LZ complexity distance has been applied to the reconstruction of a phylogenetic tree of 26 species of placental mammals (Eutheria) with three outgroup species.

Model

Sequence LZ complexity and conditional LZ complexity

Preliminaries

Given a symbolic sequence S = s1s2 … s, the function l(S) = n denotes the length of S. S(i, j) denotes the subsequence ss … s of S that starts at position i and ends at position j, where if i > j or j < 1, then S(i, j) is a null sequence (denoted by φ). The vocabulary of S, denoted by v(S), is defined as the set formed by all the subsequences (words). The concatenation of S and another sequence Q forms a new sequence R = SQ, where S is called a prefix of R and R is called an extension of S. If there exists an integer i, then S = R(1, i). When the length of S is not specified explicitly, it is convenient to identify the prefix of S by means of a special operator π where S = S(1, l(S) − i), i = 0, 1, … In particular, S = S, and S = φ for i ≥ l(S). An extension R = SQ is said to be reproducible from S, denoted by S → R, if Q ∈ v(R). In sequence reproduction process, since Q ∈ v(R) implies the existence of a positive integer p ≤ l(S) such that q = r, i = 1, 2, …, l(Q), R can be generated from S by first copying the known symbol s = r of S to obtain q1 = r1+; then q2 = r2+ can be obtained by copying r (which may still be a symbol of S or, if p = l(S), the first and already known symbol of Q), and so on, until the last symbol of Q. A nonnull sequence S is said to be producible from its prefix S(1, j), denoted by S(1, j) ⇒ S, if S(1, j) → Sπ and j < l(S). The distinction between the production process S(1, j) ⇒ S and the reproduction process S(1, j) → S lies in the recursive copying process that characterizes the latter. It is required that the extended subsequence S(j + 1, l(S)) belongs to the vocabulary of S, namely S(j + 1, l(S)) ∈ v(S), in the reproduction process. While in the production process, it is required that the subsequence S(j + 1, l(S) − 1) belongs to v(S). The production process allows for a single-symbol innovation at the end of the copying process.

Sequence LZ complexity

Any nonnull sequence S can be built from a null sequence φ using an m-step production process: Note that 1 ≤ m ≤ l(S) and h = l(S). Let h0 = 1, the above m-step production process of sequence S can result in a parsing of S as follows: where H(S) is called a production parsing of sequence S and H(S) = S(h + 1, h) is called the ith production component of H(S). The number of production components in a production parsing is denoted by c. A production component H(S) and the corresponding production step S(1, h) ⇒ S(1, h) are said to be maximum if S(1, h) ↛ S(1, h), where ↛ denotes the negation of →. A production parsing H(S) is said to be minimum if each of its production components, with a possible exception of the last one, is maximum. Using E(S) to denote the minimum production parsing, the number of production components in E(S) can be denoted as c. It has been proved by Lempel and Ziv that the minimum production parsing of a given sequence is unique (. Lempel and Ziv ( defined the complexity of a sequence as the number of production components in the minimum production parsing of this sequence, which is called sequence LZ complexity. Using c(S) to denote the LZ complexity of sequence S, we have c(S) = c. According to the definition of sequence LZ complexity, the minimum production parsing of a certain sequence can be built and then the LZ complexity of this sequence can be easily obtained. Kaspar and Schuster ( presented a detailed algorithm and a flow chart to compute sequence LZ complexity. The following three inequalities have also been proved in previous studies 9., 10.: For a detailed analysis of many other properties of sequence LZ complexity, see previous studies 9., 10..

Sequence conditional LZ complexity

Sequence LZ complexity can significantly describe the complexity of a single sequence. To depict the complexity relationship between two sequences, we propose a notion of conditional LZ complexity according to the principle of sequence LZ complexity. Given a sequence T, an extension R = SQ of sequence S is said to be conditional reproducible from S, denoted by [T]S → R, if Q ∈ v(TR). To extend S into R, the reproduction process only uses the vocabulary of sequence R, namely v(R); whereas by concatenating T before S, the conditional reproduction process also uses the information offered by T, namely the vocabulary v(TR), where v(R) ∈ v(TR). This is the main difference between the reproduction process and the conditional reproduction process. Given a sequence T, a nonnull sequence S is said to be conditional producible from its prefix S(1, j), denoted by [T]S(1, j) ⇒ S, if [T]S(1, j) → S and j < l(S). Similar to the production parsing of S, given a conditional sequence T, the conditional production parsing of S using an m-step conditional production process can be built as: where H(S|T) = S(h + 1, h) is called the ith conditional production component of H(S|T). The number of conditional production components in a conditional production parsing is denoted by c. A conditional production component H(S|T) and the corresponding conditional production step [T]S(1, h) ⇒ S(1, h) are said to be maximum if [T]S(1, h) ↛ S(1, h). A conditional production parsing H(S|T) is said to be minimum if each of its conditional production components, with a possible exception of the last one, is maximum. Similar to the minimum production parsing, the minimum conditional production parsing is also unique. In respect that, relative to the minimum production parsing, any conditional production component H(S|T) is obtained from a larger vocabulary v(TR) ⊇ v(R), so the length of each maximum production component will not be longer than that of the corresponding maximum conditional production component. Using E(S|T) to denote the minimum conditional production parsing, the number of conditional production components in E(S|T) can be denoted as c. Definition 1: The conditional LZ complexity of sequence S relative to the conditional sequence T is c(S|T), and c(S|T) = c. Note that the conditional LZ complexity of S relative to T equals the LZ complexity of S when T is null, namely c(S|T) = c(S) if T = φ. Given sequences S, Q, and T, the following inequalities can be deduced according to the definition of the minimum conditional production parsing and Inequalities (1), (2): Inequality (4) implies that the conditional LZ complexity of the given sequence will not be increased by concatenating a sequence after or before the conditional sequence. Inequality (5) implies that the conditional LZ complexity of the given sequence will not be decreased by concatenating a sequence after or before the original sequence. We present another inequality as the following: Let sequence R = SQ and c(SQ|T) = a. The minimum conditional production parsing of R with given T is E(R|T) = R(1, h1) … R(h + 1, h). Assuming that the last symbol of sequence S, s, lies in the kth maximum conditional production component of sequence R with given T, then E(R|T) = R(h + 1, h), we have (h + 1) ≤ l(S) ≤ h and c(S|T) = k. Let sequence L = R(l(S) + 1, h) and sequence M = R(h + 1, l(R)), then it is obvious that Q = LM. R(h + 1, h) … R(h + 1, h), a suffix of E(R|T), happens to be the minimum conditional production parsing of sequence M relative to the conditional sequence TSL, that is, E(M|TSL) = R(h + 1, h) … R(h + 1, h). Hence c(M|TSL) = a − k = c(R|T) − c(S|T) and c(R|T) − c(S|T) = c(M|TSL). For LM = Q, by Inequality (4), c(M|TSL) ≤ c(M|TS), and by Inequality (5), c(M|TS) ≤ c(LM|TS) = c(Q|TS). Since c(R|T) − c(S|T) = c(M|TSL) ≤ c(Q|TS), so c(R|T) = c(SQ|T) ≤ c(S|T) + c(Q|TS). The following inequality indicates that conditional LZ complexity satisfies the triangle inequality: By Inequality (4), c(Q|TS) ≤ c(Q|S). By Inequality (5), c(Q|T) ≤ c(SQ|T). Adding the above two deduced inequalities, we have c(Q|TS) + c(Q|T) ≤ c(Q|S) + c(SQ|T), that is, c(Q|T) ≤ c(Q|S) + c(SQ|T) − c(Q|TS). By Inequality (6), c(SQ|T) − c(Q|TS) ≤ c(S|T). Hence c(Q|T) ≤ c(Q|S) + c(S|T).

Distance metric of sequence LZ complexity

A distance metric defined on a set of objects should satisfy the following four conditions: d(x, y) > 0, ∀ x ≠ y (nonnegative); d(x, y) = 0, ∀ x = y (identity); d(x, y) = d(y, x), ∀ x ≠ y (symmetry); d(x, y) ≤ d(x, z) + d(z, y), ∀ x, y, z (triangle inequality). For nonnull sequences, we can measure the similarity between two sequences in quantity by computing their conditional LZ complexity. By Inequality (6), sequence conditional LZ complexity also satisfies the triangle inequality. However, sequence conditional LZ complexity is not in symmetry, thus it cannot be used as a sequence distance metric directly. Therefore, based on conditional LZ complexity, we propose a distance measure between nonnull sequences as: For nonnull sequences x, y, and z, by the definition of conditional LZ complexity, D(x, y) > 0 is always satisfied if x ≠ y. The proposed distance also satisfies the identity condition up to an additive O(1) term if x = y. It is obvious that D(x, y) is in symmetry for every two sequences x and y. By Inequality (7), we have c(x|y) ≤ c(x|z) + c(z|y) and c(y|x) ≤ c(y|z) + c(z|x). Hence max{c(x|y), c(y|x)} ≤ max{c(x|z), c(z|x)} + max{c(z|y), c(y|z)}, which implies that D(x, y) also satisfies the triangle inequality. Thus, the proposed distance is a valid distance metric. We call the proposed distance metric defined on nonnull sequences as LZ complexity distance.

Application

The mammalian phylogenetic relationship at the molecular level still remains to be a controversial topic in nowaday molecular genetics (. Researches using different types of molecular data and analysis methods result in different conclusions to the debate about which two of the three main groups of placental mammals, namely Primates, Ferungulates, and Rodents, are more closely related. There are three possible phylogenetic trees, as shown in Figure 1, by introducing an outgroup that has comparatively close relationship to placental mammals into the phylogeny analysis. Alignment analysis using some proteins encoded by mitochondrial genome supports that the evolutional relationship between Primates and Rodents is more closely related (. The reconstructed phylogenetic tree’s topology suggested in these studies is [Ferungulates (Primates, Rodents)] (Figure 1B). However, alignment analysis using mitochondrial DNA (mtDNA) sequences ( or some proteins encoded by nuclear genome ( gives the tree’s topology of [Rodents (Primates, Ferungulates)], which suggests that Primates and Ferungulates are more closely related (Figure 1A).

Fig. 1

Three possible trees among Primates, Ferungulates, and Rodents relative to the outgroup.

Motivated by the studies of Cao et al. ( and Reyes et al. (, we chose the whole mitochondrial genomes of 26 species of placental mammals as molecular data to reconstruct the phylogenetic tree of Eutherian orders. Similar to their studies, opsossum, wallaroo, and platypus were selected as the outgroup. All the 29 data files were obtained from the GenBank database, and the 29 species and their access numbers are listed in Table 1.

Table 1

The 29 Mammalian Species and Their GenBank Access Numbers

Group	Species	Access number
Primates	Human (Homo sapiens)	V00662
	Common chimpanzee (Pan troglodytes)	D38116
	Pigmy chimpanzee (Pan paniscus)	D38113
	Gorilla (Gorilla gorilla)	D38114
	Orangutan (Pongo pygmaeus)	D38115
	Gibbon (Hylobates lar)	X99256
	Baboon (Papio hamadryas)	Y18001

Ferungulates	White rhinoceros (Ceratotherium simum)	Y07726
	Harbor seal (Phoca vitulina)	X63726
	Gray seal (Halichoerus grypus)	X72004
	Cat (Felis catus)	U20753
	Fin whale (Balenoptera physalus)	X61145
	Blue whale (Balenoptera musculus)	X72204
	Cow (Bos taurus)	V00654
	Horse (Equus caballus)	X79547
	Donkey (Equus asinus)	X97337
	Great rhinoceros (Rhinoceros unicornis)	X97336
	Dog (Canis familiaris)	U96639
	Sheep (Ovis aries)	AF010406
	Pig (Sus scrofa)	AJ002189
	Hippopotamus (Hippopotamus amphibius)	AJ010957

Rodents	Rat (Rattus norvegicus)	X14848
	Mouse (Mus musculus)	V00711
	Squirrel (Sciurus vulgaris)	AJ238588
	Fat dormouse (Glis glis)	AJ001562
	Guinea pig (Cavia porcellus)	AJ222767

Outgroup	Opossum (Didelphis virginiana)	Z29573
	Wallaroo (Macropus robustus)	Y10524
	Platypus (Ornithorhyncus anatinus)	X83427

Firstly, 29 mtDNA sequences were extracted from the above 29 data files. Then the conditional LZ complexity between every two sequences was computed. The LZ complexity distances were measured according to Equation (8). Using the LZ complexity distances between sequences, a distance matrix was built up. To reconstruct the phylogenetic tree, we utilized the neighbor-joining method ( in PHYLIP software package of version 3.63 ( and the TreeView tool of version 1.6.6 (. The phylogenetic tree reconstructed through the proposed LZ complexity distance method is shown in Figure 2. It indicates the topology of [Rodents (Primates, Ferungulates)] about the Eutherian orders’ phylogeny, which is in accordance with the overall structure of the phylogeny presented in the studies of Cao et al. ( and Reyes et al. (. Furthermore, all branches in the tree completely agree with the result in Cao et al. ( and most of the clades conform to the result in Reyes et al. ( except for the position of guinea pig. As a species of nonmurid rodents, guinea pig is grouped into neither nonmurid rodents nor murid rodents, but shows an outgroup status relative to Primates, Ferungulates, and Rodents in Figure 2. Such an unexpected disagreement may suggest some deep biological implications, for the phylogenetic position of guinea pig stays as one of the most controversial topics in system biology 18., 19., 20..

Fig. 2

The phylogenetic tree reconstructed from the mtDNA sequences of 26 species of placental mammals using LZcomplexity distance, where opossum, wallaroo, and platypus were used as the outgroup.

In this study, we also reconstructed a phylogenetic tree using sequences of coding regions (data not shown). A total of 12 mitochondrial genes that encode 12 mitochondrial proteins were extracted from each of the 29 species’ mitochondrial genomes. Then the 12 gene sequences corresponding to one species were concatenated to form a new mtDNA sequence. We computed the LZ complexity distance between every two of these 29 concatenated sequences and then built up a distance matrix from these data. Using the distance matrix, another phylogenetic tree was reconstructed and it was completely in accordance with the tree shown in Figure 2. Phylogeny inferred through the above approach also implied that Primates and Ferungulates are more closely related.

Conclusion

The proposed sequence LZ complexity distance satisfies all the four conditions of distance metric theoretically and has been applied successfully to the phylogenetic tree reconstruction of 26 species of placental mammals. The phylogeny inferred through the LZ complexity distance measure is in agreement with the overall structure of some previous studies, which indicates the validity of using the proposed sequence LZ complexity distance to analyze the evolutionary relationship of DNA sequences in quantity. The computation of the proposed distance is totally automatic and alignment-free. Unlike most existing methods of phylogenetic tree reconstruction, the proposed method does not require gene identification nor any prior biology knowledge such as an accurate alignment score matrix. Among the debate that which two of the three main groups of placental mammals, namely Primates, Ferungulates, and Rodents, are more closely related, the phylogenetic tree reconstructed based on the proposed sequence LZ complexity distance using whole mitochondrial genome supports the suggestion that Primates and Ferungulates are more closely related. In the reconstruction of the phylogenetic tree of 26 species of placental mammals, results obtained respectively from the complete mitochondrial genomes and some coding regions in mitochondrial genomes are both significant in biological sense. Thus we see that the proposed method works well without the limitations of coding sequences. The proposed sequence LZ complexity distance provides a new available choice to compare and analyze noncoding sequences abounded in genomes.

12 in total

6. The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Marsupialia, and Eutheria.

Authors: A Janke; X Xu; U Arnason
Journal: Proc Natl Acad Sci U S A Date: 1997-02-18 Impact factor: 11.205

LZ complexity distance of DNA sequences and its application in phylogenetic tree reconstruction.

Introduction

Model

Sequence LZ complexity and conditional LZ complexity

Preliminaries

Sequence LZ complexity

Sequence conditional LZ complexity

Distance metric of sequence LZ complexity

Application

Conclusion

1. Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris.

2. A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison.

Review 3. Alignment-free sequence comparison-a review.

4. Complete mitochondrial DNA sequence of the fat dormouse, Glis glis: further evidence of rodent paraphyly.

5. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders.

6. The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Marsupialia, and Eutheria.

7. TreeView: an application to display phylogenetic trees on personal computers.

8. Phylogenetic position of guinea pigs revisited.

9. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

10. Mammalian phylogeny inferred from multiple protein data.