Literature DB >> 21918630

Construction of random perfect phylogeny matrix.

Mehdi Sadeghi¹, Hamid Pezeshk, Changiz Eslahchi, Sara Ahmadian, Sepideh Mah Abadi.

Abstract

PURPOSE: Interest in developing methods appropriate for mapping increasing amounts of genome-wide molecular data are increasing rapidly. There is also an increasing need for methods that are able to efficiently simulate such data. PATIENTS AND METHODS: In this article, we provide a graph-theory approach to find the necessary and sufficient conditions for the existence of a phylogeny matrix with k nonidentical haplotypes, n single nucleotide polymorphisms (SNPs), and a population size of m for which the minimum allele frequency of each SNP is between two specific numbers a and b.
RESULTS: We introduce an O(max(n(2), nm)) algorithm for the random construction of such a phylogeny matrix. The running time of any algorithm for solving this problem would be Ω (nm).
CONCLUSION: We have developed software, RAPPER, based on this algorithm, which is available at http://bioinf.cs.ipm.ir/softwares/RAPPER.

Entities: Chemical Disease Gene Species

Keywords: minimum allele frequency (MAF); perfect phylogeny; recursive algorithm; tree

Year: 2010 PMID： 21918630 PMCID： PMC3170006 DOI： 10.2147/AABC.S13397

Source DB: PubMed Journal: Adv Appl Bioinform Chem ISSN： 1178-6949

Introduction

With the widespread availability of molecular data, computational methods for gene mapping are being developed. Often, the statistical properties and the behavior of these methods need to be assessed and tested by simulation. By increasing the number of computational methods for gene mapping, there is an increasing need for tools that can simulate data for long genomic regions. One of the most popular models used to infer haplotypes from genotype data is perfect phylogeny.1–3 This model reconstructs haplotype sequences with the assumptions of infinite sites and no recombination. Given a set of genotypes, the goal is to find a set of haplotypes that fit a perfect phylogeny. The solution divides haplotypes into disjoint blocks that are all compatible with a perfect phylogeny tree. Each block can be seen as a region of genome with different evolutionary history. Simulation of genotype or haplotype data based on a coalescent model is central to estimation methods and testing new methodologies. Coalescent simulation can be used to understand the statistical properties of DNA sequences under different evolutionary scenarios and also evaluate and compare different methods for haplotype analysis. A number of simulation programs have been developed under this model and are currently being used.4–11 We suggest a haplotype simulation to produce haplotype data with predefined allele frequencies with coalescent property. By using the set of haplotypes that satisfy the coalescent property, we can simulate a long genomic region, which can be used as an approximation to the evolutionary process that produce the real data. This simulation constructs a random perfect phylogeny matrix (PPM) with k nonidentical haplotypes, n single nucleotide polymorphisms (SNPs), and a population size of m in which the minimum allele frequency (MAF) of each SNP is between two predefined numbers. A simulated data set for generating a long DNA sequence can be constructed based on assumptions about recombination rate and distribution in an evolutionary model among these perfect phylogeny blocks. The phylogenetic tree is represented by a matrix A in which a is the state of character j in sequence i, and the ith row is the character vector of sequence i. In this article, we assume that characters are binary and directed, ie, only 0→1 changes may occur on any path from the root to a leaf of the tree. For the output, the ancestral state of an allele is represented by zero. We suggest a haplotype simulation approach that produces haplotype data with prespecified allele frequencies that satisfy coalescent model, ie, it produces a phylogenetic tree in which branches model the changes through the time of evolution based on the model. By above discussions, finding a method to construct random PPM with k nonidentical haplotypes, n SNPs, and a population size of m in which the MAF of each SNP is between two specific numbers a and b will be very useful for data simulation (We consider MAF of column c in A as the number of 1’s in column c.) In this article, we take a graph-theory approach to the problem and show that there is a one-to-one correspondence between the set of perfect phylogeny matrices with certain conditions and some rooted trees. We find the necessary and sufficient conditions for the existence of such trees with respect to input parameters. We present an O(max(n, nm)) algorithm for generating a random matrix with the above conditions. We have developed a software based on this algorithm, RAPPER, which generates these matrices in a reasonable time. This article is organized as follows: we provide preliminaries and formulate the problem; in sections 3 and 4, Matrices and trees, and Extended tree following Gusfield,12 we construct a tree for every matrix and discuss its properties. In Necessary conditions, we discuss the necessary conditions. In Sufficient conditions, we find some sufficient conditions and present an algorithm to generate a random sample of the abovementioned matrices.

Preliminaries

To find some necessary and sufficient conditions for the existence of a PPM with m rows, k nonidentical rows, and n columns such that in every column, number of 1s are between a and b, we need some definitions. We consider the cases that k ≥ 2. Definition 1. Let a and b be 2 integers and assume a ≤ b. The matrix B is called a (k,a,b)-PPM if B is a PPM The number of 1’s in each column is between a and b B has k nonidentical rows Example 1. The following matrix B is a (3,2,3)-PPM. Definition 2. The matrix A is called (m,a,b)-extendable if A is a PPM A has k nonidentical rows There exists a matrix B that is a (k,a,b)-PPM, and the rows of A and B are identical. (In this case, we say that A is extendable to B.) Example 2 A is (5,2,3)-extendable to B. It is obvious that there exists a (k,a,b)-PPM matrix B if and only if there exists an (m,a,b)-extendable matrix A. Definition 3. A matrix B is called good if it can be decomposed as follows: The entries of its leftmost column are all 1’s. There exist good matrices B, B, …, B such that the rest (0 or more) of the columns of B form the block structure, as illustrated in Figure 1.

Figure 1

Block structure of a good matrix.

Definition 4. A matrix A is called canonical if it satisfies the second condition of the good matrix definition. In the following definition, we assign a root to each good matrix. Definition 5. In a good matrix B, we consider the leftmost all-one column as the root of the B.

Matrices and trees

Constructing a tree from a matrix

According to Theorem 8 of Pe’er et al,13 every PPM has an ordering of its rows and columns, which yields a canonical matrix. Theorem (Pe’er et al).13 Let B be a binary matrix. The following are equivalent: B has a phylogenetic tree. There exists an ordering of the rows and columns of B, which yields a canonical matrix. Let A be a PPM that consists of B, B, …, B good blocks and c be the corresponding root of B. Following Gusfield,12 we construct a labeled tree T() by using the following steps (see Figure 2):

Figure 2

Constructing a labeled tree TA from a perfect phylogenetic matrix, A.

Add an all-one column to A as the leftmost column. Index this column by 0. Let vertex 0 be the parent of c for all 1 ≤ i ≤ d. Construct a tree from canonical form of every B in a recursive manner. (Note that B is a good matrix and has an all-one column.) The vertex set of T() is {0, 1, 2, …, n}. Now, we label some vertices of T() as follows: If the last 1 entry in row r occurs in column j, then we label vertex j of T() with r. Gusfield12 proved the following lemma. Lemma 1. Every leaf of T() is labeled, and every vertex is labeled at most once.

Necessary and sufficient conditions for PPM

Let A be a matrix and O = {r|a = 1}. According to Estabrook et al14 and Meacham,15 A is PPM if and only if for every two columns u and v, O ∩ O = φ or O ⊂ O or O ⊂O. Lemma 2. Let r be a row in A and v be a vertex in T() with label r; then a = 1 if and only if u is located in a path from root to v. Proof. Let u be located in a path from root to v. So u is an ancestor of v. By the way that the tree was constructed from canonical form of matrix A, we have O ⊂O. So, if a = 1 then a = 1. Now, let a = 1. Since v is labeled by r, the last nonzero entry of r occurs in v. So, for every child of v such as w, we have a = 0. Therefore, a = 1 implies that u is not a child of v. As a = a = 1, O ∩ O ≠ φ. Thus, O ⊂O or O ⊂O. Since u is not a child of v, O is not a subset of O. So, we have O ⊂O′ and u is an ancestor of v. Therefore, u is located in a path from root to v.

Constructing a matrix from the tree

Definition 6. A rooted tree T is called a (k,n)-complete tree if it satisfies the following conditions: V(T) = {0,1,2, …, n}. For every 1 ≤ i ≤ k, there exists exactly 1 vertex with label i. Every leaf is labeled. Every vertex has at most 1 label. From Lemma 1 and the way we construct T(), we obtain that T() is (k,n)-complete tree where A is a PPM. Now, for every (k,n)-complete tree T, we construct a PPM A with nonidentical rows as follows: Let c be an arbitrary vertex of T and T be the subtree of T with root c. We construct A = [a] by a = 1 if and only if there exists a vertex in T with label r. Gusfeild (1991) showed that A is a PPM. Let r be a row of A, which is the label of u in T. Similar to Lemma 2, a = 1 if and only if w is located in a path from root to u. Since labels are different and there is a unique path between the root and the vertices of T, rows of A are nonidentical. It is obvious that A = A and T () = T. Therefore, there is a one-to-one correspondence between the set of all PPM with k nonidentical rows and n columns and the set of all (k,n)-complete tree.

Extended tree

Let A be an (m,a,b)-extendable matrix, and let the (k,a,b)-PPM matrix B be its extension and T() be its corresponding tree. Let w be the repeating time function defined on the labeled vertices of T() as w(v) = t if and only if v is labeled by r and row r is repeated t times in B. We call (T(),w) the (m,a,b)-extended tree of A and w(v) the repeating label of v. Lemma 3. The MAF of column c in B is the sum of the repeating label of the vertices in T. Proof. Let a = 1. Then, by Lemma 3.2, c is located in a path from root to vertex v with label r. So, v is a vertex of T. Let w(v) = t. Corresponding to t repeats of r, we have t ones in column c. Therefore, the MAF of column c is equal to the sum of repeating labels in T. Example 3. For matrices in Example 2, repeating labels and MAFs are shown in Figure 3. Let A be an (m,a,b)- extendable matrix and (T(),w) be the (m,a,b)-extended tree of A. Now, by the previous lemmas, we have the following observations.

Figure 3

Repeating labels and minimum allele frequency (MAF).

Observations: O: Let u be the ancestor of v in (T(),w); then the MAF of column u is greater than or equal to the MAF of column v. O: Let v be a leaf with label i in (T(),w). By proof of Lemma 3.1, column v in A has only 1 nonzero entry in row i. Since the MAF of column v should be at least a, w(v) ≥ a. O: Let T have l leaves and c labeled internal vertices. By using Lemma 4.1 and O, the MAF of column u in B is at least al + c. O: Let T have l leaves and c labeled internal vertices. Since the MAF of each column in B is at most b, O implies that l ≤ b/a. O: Let x, x, …, x be the children of root r of (T(),w). Then, using O for each labeled vertex x, we have w(x)≤b if and only if w(x)≤b for every labeled vertex of (T(),w). In the following theorem, we show that we can always assume that the desired matrix has at least one all-zero row. By the MAF of vertex v in T(), we mean the MAF of column v in A. Theorem 1. There is an (m,a,b)-extendable matrix A if and only if there is an (m,a,b)-extendable matrix A′ that has a zero row. Proof. It is obvious that root r of T() is labeled when A has a zero row. Let A has no zero rows and r is not labeled. We consider two cases: There exists an internal node u, which is labeled by p. In this case, we consider the labeled tree T ′ by removing label p of u and giving p to r. Let only the leaves of T() be labeled. Consider vertex x such that degree x in T is at least 2, and in T, every vertex of T–x has at most 1 child (as k ≥ 2 and there is no labeled internal node and T() has at least two leaves, there exists such x). Let u and v be two leaves of T and z be the ancestor of v and the child of x. (If v is a child of x let z = v.) Since u is a leaf of T(), it has a label such as p. By removing edge xz from T(), labeling p from u, adding edge uz, and giving p to r, we obtain a new labeled tree T ′ (Figure 4). In both cases, we define repeating time function w′ on the labeled vertices of T ′ by

Figure 4

New labeled tree T ′ obtained from T.

It is obvious that T ′ is a (k,n)-complete tree, which has a zero row. Now, by using O, …, O and Lemma 3, we conclude that A′ is (m,a,b)-extendable.

Necessary conditions

In this section, we describe some necessary conditions for the existence of an (m,a,b)-tree. By applying Theorem 1, we find the conditions necessary for the existence of an (m,a,b)-tree in which the root has been labeled. First, we introduce some necessary conditions, and then in the next section, we will show that these conditions are also sufficient. Theorem 2. Assume that T is a (k,n)-complete tree and (T,w) is an (m,a,b)-tree in which the root has been labeled. Let r be the root of T, deg(r) = d, and T has l leaves. Then l ≤ k − 1. . l (a − 1) + km ≤. Proof. Since r and the leaves of T are labeled with nonidentical rows, 1 holds. It is obvious that d ≤ l. By O for each leaf v, w(v) ≥ a. Suppose x,x, …, x are the children of r, and T contains l leaves and c internal labeled vertices. Then by O, we have Therefore So, By O for each leaf v, we have w(v) ≥ a, and for each labeled vertex u, we have w(u) ≥ 1. Then, the number of rows in a (k,a,b)-PPM matrix B, which is an extension of extended A, is at least We categorize the children of r, x,x, …, x, into 3 groups: A1 = {ξ|χ = 0 ανδ λ = 1} A2 = {ξ|χ = 0 ανδ λ ≠ 1} A3 = {ξ|χ ≠ 0} Let α = |A1| and β = |A2|. Theorem 3. Let (T,w) be the same as in Theorem 2 and B is its corresponding (k,a,b)-PPM. Then l ≤ b/a d + l − n ≤ a Let a/b (a be a divisor of b) and a ≠ b; then the number of x in which T has b/a leaves is less than or equal to n − k + 1. Proof. It results from the observation O. Let n be the number of vertices of T. Then, obviously So, we have Now, we find the upper and lower bounds. Upper bound Lower bound We have |A1 ∪ A2| = d − a. Since the number of internal vertices of T which have labels is k − l − 1, Now, we have and part 2 is thus proved. Suppose a/b and a ≠ b. Let S = {x|l = b/a}. By observation O, for each x ∈ S, we have c = 0. So, S ⊂ A2, and by inequality 1, we have |S| ≤ n − k + 1.

Sufficient conditions

In the previous section, we obtained some necessary conditions for the existence of an (m,a,b)-extendable tree whose root has been labeled. In this section, we show that these conditions are sufficient too. For this purpose, let l1, l1, …, l satisfy the conditions of Theorems 2 and 3. Then, we introduce an algorithm for constructing the rooted (m,a,b)-extendable tree T with root r; x1, …, x are the children of r and T has l leaves. First, we determine the number of labeled internal nodes in each T.

Determination of c

We categorize the children of r into three groups: G1: Children with l = 1. By Theorem 3, part 2, we have |G1| ≥ d + l − n. G2: Children with l = b/a and l ≠ 1. G3: Children with l = b/a for b ≠ a. By Theorem 3, part 3, we have |G3| ≥ n − k + 1. According to G1,G2, and G3, we determine c s as follows: For each x ∈ G2 and d + l − n elements of G1, we set c = 0. Let |G2| ≤ k − l + 1. Then, for each x ∈ G2, we assign c = 1. Now, we distribute the value k − l + 1 − |G2| among the members of G2 and those of G1 for which c ≠ 0 or c is not determined in step 1. Now, suppose |G2| ≥ k − l + 1. Let F be a subset of G2 of size k − l + 1. For each x ∈ F, set c = 1. For all the remaining vertices such as x, which is not considered in the above steps, we assign c = 0. By this procedure, part 2 of the Theorem 3, (|A1| ≥ d + l − n) holds. Categorize d children of the root for i = 1 to d if l = 1 then put this child in G1 if l ≠ b/a and l ≠ 1 then put this child in G2 if l = b/a and b ≠ a then put this child in G3 determine c (related to each x) according to G if x ∈ G3 or d + l − n ∈ G1 c = 0 else if |G2| ≤ k − l + 1 if x ∈ G2 then c = 1 Distribute k − l +1 − |G2| among members of G2 and those of G1 for which c ≠ 0 else if |G2| > k − l + 1 F ← (subset of G2 of size k − l + 1) if x ∈ F then c = 1 . else c1 = 0. We first initialize n for each x as follows: We distribute n − ∑n between all T at random.

Determination of n

To show that these steps are possible, it is enough to show that ∑n ≤ n. By the proof of part 2 of Theorem 3, it is enough to show that β ≤ n − k + 1. The number of vertices for which l ≠ 1 and c = 0, β, is as follows: If |G2| < k − 1 + 1, then β = |G3|. If |G2| ≥ k − l + 1, then β = |G3| + |G2| − k + l − 1. Since |G3| + |G2| = d − |G1| and |G1| ≥ d + l − n, |G3| + |G2| ≤ n − l, in both cases, β ≤ n − k + 1. Initialize n for each x if c ≥ 1 then n = c + l else if c = 0 and l ≠ 1 then n = l + 1 else if c = 0 and l = 1 then n = 1 Distribute n − ∑n between trees related to children of the root (T)

Constructing (m,a,b)-extendable tree

In this subsection, we will construct a rooted tree T with root r. x1, …, x are its children, and T has n vertices, l leaves, and c labeled internal vertices for which cs and ns are determined as described earlier. The following algorithm constructs T, 1 ≤ i ≤ d. Let LS be the set of leaf vertices Let IS be the set of internal vertices IS ← x LS ← φ for j ← 2 to n do if sizeof (LS) = l then PS = LS else if l − sizeof (LS) = n −j + 1 then PS = IS else PS = LS ∪ IS select vertex v from PS randomly and put the new vertex, w, as v’s child LS = LS ∪ w if v ∈ LS then LS ← LS − v IS ← IS ∪ v Mark c vertices from IS.

Algorithm for the construction of T(n, l, c)

Now, we add the root r and edges rx, 1 ≤ i ≤ d. The desired tree T is constructed. To label the vertices of T, 1 ≤ i ≤ d and root r and the leaves of T, we assign {1,2, …, k} to the labeled vertices of randomly. In this algorithm, we first construct all ordered pairs (l,d) that satisfy the conditions of Theorems 2 and 3. Then, we choose some of these pairs randomly and construct all d-tuples (l1, …, l), satisfying the conditions of Theorems 2 and 3 and l = l1 + l2 + … + ld. Now, one of the d-tuples is chosen randomly. Then, we classify x according to l. Using this classification, we consider a primary class for c. Now, for the remaining vertices for which we have not assigned any c, we choose a c randomly. For calculating n, we first assign an initial value for each vertex x and then distribute n − ∑n randomly. It should be noted again that by randomness, we mean distribution according to a uniform random variable. We also define a function w on the labeled vertices of the (k,n)-complete tree, T, such that (T,w) becomes an (m,a,b)-tree (Figure 5). We obtain w from the following recursive algorithm:

Figure 5

In this figure, we have w(1) = w(2) = w(5) = 2 and w(3) = w(4) = 1.

We define w+1 recursively from w as follows: If there exists x such that the MAF of x in (T,w) is less than b, we choose an arbitrarily labeled vertex from Txi, such as u, and define w+1(u) = wi(u) + 1. We continue this procedure until we obtain the function w such that or the MAF of x = b for all 1 ≤ i ≤ d. Now, w is defined by w = w if . For the case that , we consider w by Let the vertex u of A have labels r0 and w(u) = 0. We repeat row r0 of A, t times. Let B be the matrix obtained from A by repeating the procedure. It is obvious that B has m rows, n columns, and k nonidentical rows. Let u be a column of B and a vertex of (T,w). Consider T and one of its leaf, such as u0, with label r0. We know that w(u0) ≥ a. The entry of A in column u0 and row r0 is 1. Therefore, by Lemma 2, the entry of A in column u and row r0 is also 1. Since we repeat row r0 in B at least a times, the MAF of column u in B is at least a. On the other hand, let x be the ancestor of u. Since w(u) ≤ w(x) and MAF(xi) ≤ b by observation O, MAF(u) ≤ b. Therefore, B is a (k,a,b)-PPM.

Running time of the algorithm

This algorithm has five steps. In the first step, the algorithm finds d, l, and l, …, l, which satisfy the necessary conditions in O(n). In the second step, the algorithm finds the n and c for 1 ≤ i ≤ d. In the third step, the algorithm constructs Tx, 1 ≤ i ≤ d, with n internal vertices, c labeled internal vertices, and l in O(n2). As and , the running time of the algorithm in this part is O(n2). In the next step, the algorithm defines a function on labeled vertices and finds its value recursively. As this function is called at most m times and in each call updating MAF of x’s takes O(n), the running time of the algorithm in this part is O(mn). In the last step, the algorithm constructs the desired matrix from the tree in O(mn). Thus, the total running time of the algorithm is O(max(n, nm)).

Discussion

In this article, we have presented a new model for perfect phylogeny matrices. Our goal was to construct a random perfect phylogeny matrix with k different haplotypes, n SNPs, and a population size of m for which the MAF of each SNP is between two specific numbers a and b. Our new approach allows us to find the necessary and sufficient conditions for the existence of such a matrix. As the solution matrix is a binary matrix with m rows and n columns, any algorithm for this problem is Ω(nm). We developed an O(max(n, nm)) time algorithm based on this model to solve this problem. We used the available methods to construct the perfect phylogeny matrix without taking MAF into account, and we then eliminated those columns that do not satisfy the MAF condition. It should be noted that there are two problems concerning this approach. First, we need to use an algorithm that is able to calculate the MAF of each column automatically and eliminate it if the conditions are not satisfied. Second, it is very probable that the number of columns that should be removed is very high. So, we will obtain matrices with few columns, and we have to run the algorithm several times to obtain a matrix with n columns and the required MAF. Therefore, an algorithm that could construct such matrices is of much interest. We have developed software, RAPPER, for implementing this algorithm, which is available at http://bioinf.cs.ipm.ir/softwares/RAPPER.

10 in total

1. SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography.

Authors: L Excoffier; J Novembre; S Schneider
Journal: J Hered Date: 2000 Nov-Dec Impact factor: 2.645

Construction of random perfect phylogeny matrix.

Introduction

Preliminaries

Matrices and trees

Constructing a tree from a matrix

Necessary and sufficient conditions for PPM

Constructing a matrix from the tree

Extended tree

Necessary conditions

Sufficient conditions

Determination of c

Determination of n

Constructing (m,a,b)-extendable tree

Algorithm for the construction of T(n, l, c)

Running time of the algorithm

Discussion

1. SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography.

2. Generating samples under a Wright-Fisher neutral model of genetic variation.

3. Simulating haplotype blocks in the human genome.

4. Haplotyping as perfect phylogeny: a direct approach.

5. SelSim: a program to simulate population genetic data with natural selection and recombination.

6. HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients.

7. msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots.

8. Efficient reconstruction of haplotype structure via perfect phylogeny.

9. CoaSim: a flexible environment for simulating genetic data under coalescent models.

10. Fast "coalescent" simulation.