Narayanan Raghupathy1, Dannie Durand. 1. Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, USA. nraghupa@cs.cmu.edu
Abstract
Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such "gene clusters" is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data analysis in genomes of various sizes and illustrate the utility of our methods by applying them to gene clusters recently reported in the literature. Mathematical code to compute cluster probabilities using our methods is available as supplementary material.
Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such "gene clusters" is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data analysis in genomes of various sizes and illustrate the utility of our methods by applying them to gene clusters recently reported in the literature. Mathematical code to compute cluster probabilities using our methods is available as supplementary material.
Identifying homologous genomic regions is an important step for many comparative genomic analyses. Evidence of spatial gene conservation is used in comparative map construction, function prediction, operon detection, protein interaction, phylogeny reconstruction based on breakpoint or rearrangement distance, reconstruction of ancestral gene order, and identification of horizontal gene transfer events (Dandekar et al. 1998; Huynen and Bork 1998; Overbeek et al. 1999; Tamames 2001; Tamames et al. 2001; Zheng et al. 2002; Chen et al. 2004; Hurst et al. 2004; Bourque et al. 2005; Murphy et al. 2005; Homma et al. 2007; von Mering et al. 2007). Identifying homologous regions is straightforward when the two genomic regions are so closely related that gene content and order are preserved (fig. 1). However, the task becomes more challenging when homologous regions have been scrambled by genome rearrangement events. In more diverged genomes, “gene clusters,” regions in which gene content is similar but neither gene content nor order is completely preserved, are signatures of shared ancestry (fig. 1). Before accepting such clusters as evidence of regional homology, it is essential to rule out the possibility that the observed spatial arrangement occurred by chance.
F
Evolution of a hypothetical gene cluster: (a) Ancestral genome. (b) Genomic regions immediately after a speciation or a whole genome duplication. Gene content and order are preserved. (c) Diverged genomic regions with similar gene content but in which neither gene content nor order is preserved. Black circles represent genes that do not have homologs in the depicted regions.
Evolution of a hypothetical gene cluster: (a) Ancestral genome. (b) Genomic regions immediately after a speciation or a whole genome duplication. Gene content and order are preserved. (c) Diverged genomic regions with similar gene content but in which neither gene content nor order is preserved. Black circles represent genes that do not have homologs in the depicted regions.Tests of gene cluster significance require estimation of the probability of observing a given gene cluster under a suitable null hypothesis. This probability depends on the number of conserved genes in the cluster, the number of unmatched genes between conserved genes in the cluster (gap size), the number of genes in each genome, how the gene cluster was found, gene order conservation, the number of genomes in which the cluster was observed, and the distribution of gene family sizes in the genome.Statistical models have been developed that consider various subsets of these properties, although currently none take all into account (Wolfe and Shields 1997; Vision et al. 2000; Trachtulec and Forejt 2001; Venter et al. 2001; McLysaght et al. 2002; Vandepoele et al. 2002; Calabrese et al. 2003; Durand and Sankoff 2003; Hoberman et al. 2005; Sankoff and Haque 2005; Raghupathy et al. 2008). Typically these approaches prove significance by showing that a value of test statistic, for example, the number/density of homologous gene pairs in the cluster, as extreme as the observed value is unlikely to occur under the null hypothesis. Random gene order is the most widely used null hypothesis for gene cluster statistics.Most methods model a genome as an ordered list of genes. Physical distances and chromosome breaks are ignored. When two genomes, G1 and G2, are compared, the input is a mapping between homologous genes in G1 and G2. Genes that are homologous are considered to be in the same gene family. If each gene in G1 maps to at most one gene in G2 and vice versa, then the mapping is one-to-one and represents orthology. Otherwise, the mapping is one-to-many or many-to-many. In this case, the genes in the same family may be either orthologous or paralogous. In a genome self-comparison, the input is a single genome, G, and a mapping between homologous genes within G. In this case, all genes in the same family are paralogs.A gene cluster is a set of homologous gene pairs that satisfy a set of mathematical constraints that enforce compactness (Hoberman and Durand 2005). The most frequently used gene cluster definitions fall into one of two frameworks:The “r-window cluster”: Two genomic regions, each containing r adjacent genes, that share “at least” m homologous gene pairs.The “max-gap cluster”: Two genomic regions sharing at least m homologous gene pairs such that the number of unmatched, intervening genes between adjacent homologs in the same region is never larger than a given threshold, g.A gene cluster is orthologous if each homologous gene pair consists of one gene in G1 and one gene in G2 and paralogous if the genes are drawn from nonoverlapping regions of the same genome.Durand and Sankoff (2003) showed that in addition to the parameters that characterize the gene cluster, the significance of a gene cluster depends on how the cluster was found. Typically, one of the following strategies is used to find gene clusters:The “reference region” approach: A researcher is interested in a set of genes, called the reference genes, and searches the genome for regions containing either all the reference genes or a subset of them.The “window sampling” approach: A researcher is interested in studying the evolution of regions surrounding a pair of homologous genes of interest and searches for more homologs in their immediate vicinity. Typically, two windows, W1 and W2, containing r1 and r2 genes, respectively, are compared.The “whole genome” approach: A researcher is interested in studying spatial organization on a genomic scale and finds the set of all gene clusters by scanning both genomes.Among these, the reference region and window sampling approaches are natural models for local studies focusing on specific regions or genes (Lundin 1993; Katsanis et al. 1996; Coulier et al. 1997; Endo et al. 1997; Kasahara 1997; Ruvinsky and Silver 1997; Amores et al. 1998; Hughes 1998; Pebusque et al. 1998; Smith et al. 1999; Lipovich et al. 2001; Abi-Rached et al. 2002; Spring 2002; Danchin et al. 2003; Vienne et al. 2003). The whole genome approach is appropriate for global questions, such as the nature of duplication events that shaped the evolution of the genome (e.g., McLysaght et al. 2002; Panopoulou et al. 2003; Dehal and Boore 2005). The probability of finding a cluster increases with the number of searches required to find the cluster. In the reference region approach, the number of searches is proportional to the number of genes in the genome, n. In the window sampling approach, the search space depends on the window size r = max(r1, r2). Typically, r ≪ n. In the whole genome approach, the search space is proportional to n2 because all combinations of starting positions in both genomes must be considered. Note that a cluster found using window sampling might be significant, whereas a cluster with identical properties found using a reference region approach might not be significant.Monte Carlo methods, where the distribution of test statistic is estimated by randomization, are widely used for assessing statistical significance of gene clusters (e.g., Wolfe and Shields 1997; Vision et al. 2000; McLysaght et al. 2002; Vandepoele et al. 2002). Randomization has the advantage that it preserves genomic properties (other than gene order) including gene family sizes. However, randomization methods are computationally expensive and are suitable mainly for genome scale, rather than local, analyses because a complete comparative map is needed in order to carry out randomization.The simplest analytical methods are based on the r-window model using the reference region approach and consider only the number of homologous pairs, the number of genes in the window, and the total number of genes in the genome (Trachtulec and Forejt 2001; Venter et al. 2001). Durand and Sankoff (2003) generalized r-window gene cluster statistics to window sampling and whole genome approaches using a combinatorial framework. Hoberman et al. (2005) presented the first statistical tests for determining significance of max-gap gene clusters. They developed tests for the reference region and whole genome comparison approaches, assuming that the mapping between the genes is one-to-one. To our knowledge, there are no gene cluster statistics for the max-gap model with many-to-many mapping.Cluster significance increases when gene order is conserved. If the order of all m homologs in the cluster is preserved, then it is straightforward to incorporate gene order into an existing test (Venter et al. 2001; Durand and Sankoff 2003). However, combining partial gene order with other cluster properties in a single test is challenging (Sankoff and Haque 2005). Although recently a new gene cluster model has been proposed that is more sensitive to partial order conservation (Zhu et al. 2008).Most approaches to computing the significance of clusters spanning three or more regions combine pairwise tests in various ways (Durand and Sankoff 2003; Simillion et al. 2004). However, combining pairwise comparisons does not capture the additional significance of genes that are conserved in more than two regions resulting in underestimation of cluster significance (Raghupathy and Durand 2008). In a departure from the current approaches Raghupathy and Durand (2008) presented tests for r-window gene clusters spanning exactly three regions that combine evidence from genes shared among all three regions, as well as genes shared between pairs of regions.Realistic tests require a gene family model that captures many-to-many homology. Although this assumption may be appropriate when comparing two closely related genomes, in general establishing one-to-one homology can be difficult (Chen et al. 2007). Moreover, orthology is not always a one-to-one relationship (Fitch 2000). In many cases, such as when lineage-specific duplications have occurred, a many-to-many mapping correctly represents the underlying biology. Even when the true relationship is one-to-one homology, computational methods cannot always unambiguously identify unique homologous pairs (Chen et al. 2007).Most currently available tests assume that the mapping between homologous genes is one-to-one. The few methods that consider a many-to-many model are not suitable for practical data analysis. Durand and Sankoff (2003) presented r-window cluster statistics that consider gene family size for both the reference region and window sampling approaches. They provide upper bounds on cluster probabilities for the reference region model under the fixed-size family assumption that are computationally tractable. However, their statistical tests for the window sampling approach do not admit a computationally feasible implementation. Calabrese et al. (2003) presented an approach that combines gene cluster identification and significance testing. This method implicitly assumes a binomial gene family distribution. However, empirical studies have shown that gene family sizes follow a power-law distribution (Qian et al. 2001; Rzhetsky and Gomez 2001; Karev et al. 2002; Koonin et al. 2002; Kaplan et al. 2004). Moreover, because the test is coupled with a cluster finding algorithm, it cannot be used to test the significance of clusters found by other methods. Danchin and Pontarotti (2004) recognized the importance of accounting for multiple co-orthologs when testing the significance of conserved genomic regions and proposed a strategy for downweighting homologs in gene families, in a reference region framework. However, because the weights assigned are not based on a formal model relating gene family size to the probability of observing a cluster under the null hypothesis, their approach is not appropriate to general data analysis problems.Accurate, computationally tractable analytical methods for determining cluster significance in the presence of gene families remain a major challenge. Considering gene families is important because cluster significance is sensitive to gene family size. Given a one-to-one mapping between homologous genes in G1 and G2, each gene in G1 matches at most one gene in G2. As the gene family size increases, so does the number of possible matches and, hence, the chance occurrence of gene clusters. Failure to account for gene families can lead to overestimation of gene cluster significance.In this paper, we develop computationally tractable statistical tests for r-window clusters obtained by window sampling, assuming a many-to-many mapping between homologous genes. We focus on the case where the goal is to establish the homology of a specific pair of genomic regions. One of the key features of our model is that it does not require a complete comparative map to calculate cluster significance. Detailed information about gene content is only required within the regions of interest. Outside of the gene cluster, only global properties of genomes are needed.In developing gene cluster probabilities that depend on the distribution of gene family sizes, we show that allowing arbitrary gene family size distributions leads to an expression of cluster significance that reduces to an NP-complete problem (Garey and Johnson 1979). In order to obtain tractable analytic expressions of cluster significance, we impose the assumption that all gene families are of a fixed size, ϕ. We present easily computable expressions for determining the significance of both orthologous and paralogous gene clusters under this assumption.The fixed-size gene family assumption is highly unrealistic. Gene families clearly vary in size and a number of studies have presented evidence that gene family size follows a power-law distribution (Qian et al. 2001; Rzhetsky and Gomez 2001; Karev et al. 2002; Koonin et al. 2002; Kaplan et al. 2004). In order to test the utility of the fixed-size approximation, we used Monte Carlo simulation to estimate cluster probabilities in genomes containing power-law distributed gene families. Surprisingly, the probabilities of simulated clusters based on this more realistic gene family model are well approximated by the probabilities obtained using our analytical expressions for the fixed-size gene families. Thus, our approximations offer a satisfactory balance between speed and accuracy and are suitable for practical analyses.
Methods
In our model, the genes in a genome are partitioned into nonintersecting, fully connected families. In other words, every gene in family f is homologous to all the other genes in f and only to those genes. We define the gene family size, ϕ, as the number of genes in G belonging to family, f. In the presence of gene families, we extend the definition of an r-window gene cluster to be a pair of windows, W1 and W2, of size r1 and r2, respectively, sharing m “gene families,” where m ≤ min(r1, r2). Note that two windows share a gene family if each window has at least one member from the gene family. No additional weight is given to multiple shared pairs from the same gene family.
Orthologous Clusters for Arbitrary Gene Families
Here, we derive analytical expressions for computing the probability of observing a gene cluster under the null hypothesis of random gene order, using the number of shared gene families as our test statistic. We first consider the orthologous gene cluster scenario, where a pair of windows is sampled from two different genomes (Durand and Sankoff 2003). We then extend this model to the paralogous case, where both windows are sampled from a single genome. Let be the set of gene families represented in genomes G1 and G2 and let n = ∑ ϕ be the number of genes in G. Let be the set of all subsets of ℱ containing “exactly” k gene families. The probability that W1 and W2 share at least m gene families isThe first term p1(F) is the probability that a given set, F, of k gene families is seen in W1, and the second term, p2o(E), is the conditional probability that at least l of the families in F also appear in W2. The superscript o in q2o(m) and p2o(E) indicates that these terms refer to orthologous cluster probabilities. The superscript p will be used for paralogous clusters. Because the first term, p1(F), is the same for both orthologous and paralogous cluster probabilities, it does not require a superscript.Note that both probability terms in equation (1) depend on the probability that a window of a given size contains a certain number of gene families. Let w be the window size and λ be the number of gene families. This probability depends on the number of ways of selecting w genes from genome G, such that λ gene families are represented. Let x be the number of genes from ith gene family, f. We seek the number of possible ensembles {x1, x2, … , x}, such thatand 1 ≤ x ≤ ϕ, ∀i.In preliminary work, we derived a general solution using generating functions (Raghupathy and Durand 2005). This solution makes no restriction on the gene family size distribution. However, there is little hope of finding an efficient, exact solution, as the problem of enumerating all ensembles that satisfy equation (2) can be stated as a variant of the subset sum problem, which is NP-complete (Cormen et al. 1990). In the subset sum problem, given a finite set Sℕ and a target t ∈ ℕ, we ask whether there exists a subset S′ ⫅ S whose elements sum to t. In the problem considered here, the set S corresponds to the cardinalities of the possible contributions of the gene families to the window, and the target corresponds to the window size, w. In addition to the solution to equation (2), other aspects of computing the probability for general gene families are computationally demanding. In particular, enumeration of the set of all subsets of ℱ containing exactly k gene families is prohibitively slow.In this section, we address this problem by deriving computationally tractable, approximate methods to estimate cluster significance, assuming that the gene family size is fixed. In Results, we demonstrate, using simulation that, despite the extreme nature of the assumption, our expressions yield a good approximation of cluster probabilities under a more realistic gene family model.
Orthologous Clusters, Fixed-Size Families
The complexity of calculating qo(m) can be substantially reduced under the assumption that all gene families are of equal size, ϕ. Under this assumption, it is not necessary to enumerate ℱ, because all subsets of k gene families are indistinguishable. We can instead replace the first term, ∑ p1(F), in equation (1) with the product of two quantities. The first quantity is the number of sets of k gene families, , where . The second is the probability that exactly k gene families of size ϕ are represented in the window, p1(k). Invoking a similar transformation of the second term in equation (1), the probability that W1 and W2 share at least m gene families simplifies toUnder the fixed-size assumption, p1(k) and p2o(l) correspond to the probability that exactly k families appear in W1 and exactly l families appear in W2, respectively. Because, in both cases, the probability of observing a certain number of families in a window of a particular size is required, we first derive a general expression for the probability that a window of size w contains exactly λ gene families. Let 𝒯 be a set of λ gene families of fixed size, ϕ. Given the sample space of all sets of w genes sampled from G, we wish to determine the number of sets that contain at least one gene from each family in 𝒯. Because our cluster definition does not take into account the order of genes in a window, this enumeration is equivalent to ,𝒯), the number of ensembles satisfying equation (2), when all families are of fixed size, ϕ.To determine the number of such ensembles, we note that the contribution of the ith family in 𝒯 to the window is represented by the generating functionThe coefficient of t in α(t), denoted by [t]α(t), represents the number of ways of choosing x genes from ith family. The contributions of all λ families to the window can then be derived from the product of their generating functions, . Because the generating functions for all α(t) are identical, this product isThe coefficient [t]α(t) gives the number of ways of selecting w genes such that at least one gene from each of the λ families is represented in the sample:where the sum is over the set of all λ-tuples (x1, …, x) satisfying equation (2), under the constraint that 0 < x ≤ ϕ, ∀i. Note that the right-hand side of equation (6) is the number of ensembles containing x1 genes from the first family, x2 genes from the second family, and so forth, where corresponds to the number of ways of choosing x genes from the ith gene family. This is exactly the quantity 𝒩(ω, λ,𝒯)We can avoid enumerating the set of all λ-tuples using the following simplification: Because a binomial series is of the formthe right-hand side of equation (5) is equivalent to a binomial series of the form [(1 + t) – 1]. Applying two binomial expansions yieldsTherefore, we obtainBecause no gene family can contribute more than ϕ genes to the window, at least gene families are required to fill the window. Using the above expression for and restricting the lower bound on the dummy variable u to , we obtainWe now derive a similar simplification for p2o(l | k), the probability that W2 contains exactly l of the k gene families in W2. In this case, we seek ensembles of r2 = y + z genes, where y genes are selected from the subset of l gene families. The remaining z genes are selected from families not included in W1. We must ensure that no genes from the remaining k – l gene families appear in W2, in order to guarantee that “exactly” l families are represented in W2. The probability of this event isThe first term in the inner summation represents the number of ways of selecting y = r2 – z genes such that equation (2) is satisfied when λ = l and w = y. The second term represents the number of ways of selecting the remaining z genes. The value of z in the outer summation ranges from max {0, r2 − kϕ} to r2 − l.
Paralogous Clusters, Fixed-Size Gene Families
We can extend these results to obtain a measure of significance for paralogous gene clusters. Recall that in the paralogous case, two windows, W1 and W2, are nonoverlapping windows sampled from the same genome G. The probability that they share at least m gene families iswhere p1(k) is given in equation (10). The probability for the second window differs from equation (11), however, because the fact that both windows are sampled from the same genome further constrains the set of possible ensembles for the second window.To calculate p2p(l |k), we need an expression for the number of ensembles of y genes containing exactly l gene families. Let x denote the number of genes from the ith family that appear in W. (Unlike the orthologous case, x1 and x2 are drawn from the same genome but different windows.) Because W2 is in the same genome as W1, only ϕ′ = ϕ − x1 genes from the ith family are available to fill the second window. Thus, in the paralogous case, equation (6) becomeswhere 𝒮 is the set of all l-tuples (x12, …, x2) such thatand 0 < x2 ≤ ϕ − x1.In the case of orthologous clusters, the assumption of fixed-size gene families simplified the enumeration of ensembles for both W1 and W2. However, for paralogous clusters, the fixed-size assumption does not hold when calculating the probability for W2, as we do not know how many genes from a given gene family are represented in W1. Because the number of genes available from each family to fill W2 is no longer fixed, we cannot simplify the enumeration of 𝒮 in the same way. In order to obtain a tractable expression, we make the further assumption that the number of genes from each gene family available to fill W2 is fixed as well. We define, , where is an estimate of the mean number of genes contributed by each gene family in W1. Applying the same argument used to derive equation (9) from equation (6), we obtain the following approximation:where z ranges from max{0, r2 − kϕ} to r2 − l.The fixed-size approximation and the use of generating functions to determine the number of ensembles satisfying equation (2) result in expressions that require only simple summations (eqs. 10, 11, and 14). These expressions constitute an efficient approximation to the probability that two arbitrarily chosen windows share at least m gene families under the null hypothesis of random gene order. The results provide statistical tests for orthologous and paralogous gene clusters that depend only on the size of the conserved regions, the number of shared homologous genes, and the total number of genes in the genome. They do not require information about the spatial organization of the genome outside the regions of interest. The equations required to estimate cluster probabilities are summarized in table 1. In the following section, we compare these general fixed-size probabilities with cluster probabilities obtained using a more realistic power-law model and show that our approximations are not only tractable but also accurate.
Table 1
Expressions for Computing Cluster Significance with Constant Gene Family Size
Expressions for Computing Cluster Significance with Constant Gene Family Size
Results
Effect of Simplifying Assumptions on Cluster Significance
We first investigated the effect of the simplifying assumptions used to derive our significance tests. In order to determine the accuracy of our tests as an estimate for statistical significance, we compared the probabilities obtained under our fixed-size model (table 1) with probabilities of clusters in simulated genomes with a power-law gene family distribution.The key step in performing the simulation is to construct random genomes with gene family sizes that approximate the observed distribution in real genomes. A number of studies have shown that the gene family size distribution follows a power-law of the formwhere f(x) is the number of gene families of size x, and a and b are constants (Qian et al. 2001; Rzhetsky and Gomez 2001; Karev et al. 2002; Koonin et al. 2002; Kaplan et al. 2004). We modeled our simulated genomes on three eukaryotic genomes, yeast, fly, and human. These genomes represent a wide range of genomes sizes and include a single cell organism, an invertebrate, and a vertebrate. We focused on eukaryotes over prokaryotes because of the greater size and complexity of gene families in eukaryotes.To determine the gene family distribution, we obtained nonredundant, full length amino acid sequences from the yeast, fly, and human genomes from the Swiss-Prot database Version 50.9 (Gasteiger et al. 2003). For each genome, all-against-all Blast was performed to find a set of significantly similar sequence pairs (Altschul et al. 1997). E-values obtained from the Blast results were used to cluster the sequences into families using both the single and complete linkage clustering methods implemented in the hierarchical clustering package in R (R Development Core Team 2005). In both cases, we used an E-value threshold of 10−4, which is suitable for identifying both orthologs and paralogs. Consistent with previous studies (Qian et al. 2001; Rzhetsky and Gomez 2001; Karev et al. 2002; Koonin et al. 2002; Kaplan et al. 2004), the resulting gene family size distributions approximated a power-law. Because distributions obtained using single and complete linkage clustering were similar, we only present results obtained by single linkage here. Power-law distribution parameters were obtained by fitting the observed gene family size distributions to equation (15) and are given in table 2.
Table 2
Power-Law Parameters of Gene Family Size Distributions of the Three Genomes Species Obtained Using Single-Linkage Clustering under E Value Threshold 10−04
Genome
a
b
Yeast
2,435
2.73
Fly
803
2.76
Human
2,300
2.28
Power-Law Parameters of Gene Family Size Distributions of the Three Genomes Species Obtained Using Single-Linkage Clustering under E Value Threshold 10−04For orthologous cluster probabilities, for each genome size, an artificial genome with n genes was constructed such that the gene family size distribution follows the power-law parameters given in table 2. We used genome sizes of n = 5,000, 14,000, and 22,000, which correspond roughly to the number of genes in yeast, fly, and human genomes, respectively (http://www.ensembl.org). We next generated a pool of N = 35,000 random permutations for each of these genomes. In each simulation, two genomes were selected randomly from the pool of random permutations. A window of size r was then chosen at random from each of these two genomes, and the number of gene family matches (m) between the windows was tabulated. The probability of observing exactly m matches was estimated by sampling 25,000 window pairs. The probability of observing at least m matches was calculated from the resulting distribution.The probabilities obtained using simulation were compared with the results of our analytical approximation for r = 50, r = 100, and r = 150, typical window sizes in empirical studies (Lundin 1993; Katsanis et al. 1996; Coulier et al. 1997; Endo et al. 1997; Kasahara 1997; Ruvinsky and Silver 1997; Amores et al. 1998; Hughes 1998; Pebusque et al. 1998; Smith et al. 1999; Lipovich et al. 2001; Spring 2002). For each value of n, the orthologous gene cluster probabilities under the fixed-gene family size assumption were computed using the equations in table 1, with ϕ = 2 and ϕ = 3.The results, given in figure 2, supplementary figures S1, S2, and S3, Supplementary Material online, show that the probabilities obtained using our simplifying assumptions closely approximate the simulated cluster probabilities obtained with the power-law size distribution. When n = 5,000 and n = 14,000, the probabilities obtained with ϕ = 2 slightly overestimate the simulated cluster probabilities, for all window sizes considered. Similarly, when both genomes have 22,000 genes, the estimated probabilities obtained with ϕ = 3 slightly overestimate the power-law based probabilities. Moreover, the estimated probabilities obtained with ϕ = 2 and with ϕ = 3 give lower and upper bounds on the true probability, making it possible to estimate the magnitude of the error that can result from using this approximation. These guidelines hold for genomes of different sizes (supplementary fig. S3, Supplementary Material online). It is only necessary to use ϕ = 3 when both genomes have more than 25,000 genes. If one or both of the genomes is smaller, ϕ = 2 suffices.
F
Comparison of orthologous cluster probabilities for power-law distributed and fixed-gene family sizes. The probabilities of observing at least one cluster of size m in a window of size r = 100 when genome size (a) n = 5,000, (b) n = 14,000, and (c) n = 22,000.
Comparison of orthologous cluster probabilities for power-law distributed and fixed-gene family sizes. The probabilities of observing at least one cluster of size m in a window of size r = 100 when genome size (a) n = 5,000, (b) n = 14,000, and (c) n = 22,000.These results suggest that an accurate, conservative approximation can be obtained using the equations in table 1 with ϕ = 2 for small to medium-sized genomes and with ϕ = 3 for larger genomes. These approximations slightly underestimate the significance, guarding against false positives. Moreover, for the parameter values we considered, the set of clusters deemed significant is frequently unaffected by the use of approximation. Even though the probabilities obtained with the power-law and fixed-size models differ, for many values of m and significance thresholds α, the same clusters will be rejected by both models. For example, when n = 22,000 and r = 50, both models will reject the null hypothesis for clusters of size 4, but not size 3, at the α = 0.001 significance level. In the remaining cases, the number of matches required to make a cluster significant for a given window size is overestimated by at most one. For example, when n = 22,000, r = 100, and α = 0.001, the fixed-size approximation would rule out a cluster of size 6, although the power-law model indicates it is significant. However, the fixed-size approximation will accept clusters of size 7 and greater.We also evaluated the accuracy of the approximations for paralogous clusters given in table 1 The paralogous case required a second simplifying assumption, namely, replacing ϕ − x1 with in equation (14). As described in the supplementary text, Supplementary Material online, we confirmed by simulation that using ϕ′ = ϕ −1 has little effect on the estimated cluster probability for the parameter values investigated. The use of reflects the assumption that in a random genome, the appearance of more than one gene from a given family in a window of size r is a rare event when and . We next investigated the impact of the fixed-size approximation on paralogous cluster probabilities, by comparing the approximation in table 1 with the simulated power-law model. The simulation procedure for estimating paralogous cluster probabilities was identical to that for orthologous clusters, with the exception that in each simulation two random nonoverlapping windows were sampled from a single random genome chosen from a pool of N random genomes.Figure 3 and supplementary figures S4 and S5, Supplementary Material online, show the paralogous gene cluster significance obtained by the simulation compared with that obtained using the equations in table 1 with ϕ = 2 and ϕ = 3. The results are similar to those observed with orthologous clusters. The probabilities obtained using ϕ = 2 provide an accurate, conservative approximation when n = 5,000 and n = 14,000. For larger genomes, an accurate, conservative approximation can be obtained with ϕ = 3.
F
Comparison of paralogous cluster probabilities for power-law distributed and fixed-size gene families. The probabilities of observing at least one cluster of size m in a window of size r = 100 when genome size (a) n = 5,000, (b) n = 14,000, and (c) n = 22,000.
Comparison of paralogous cluster probabilities for power-law distributed and fixed-size gene families. The probabilities of observing at least one cluster of size m in a window of size r = 100 when genome size (a) n = 5,000, (b) n = 14,000, and (c) n = 22,000.
Importance of a Many-to-Many Homology Model
Most previously published tests for gene clustering do not model gene families, instead assuming that each gene is homologous to exactly one other gene. We investigated the importance of many-to-many homology model in estimating cluster significance by comparing probabilities obtained with the one-to-one and power-law distributed homology models. Cluster probabilities for one-to-one homology were calculated using the test based on the hypergeometric function proposed by Durand and Sankoff (2003) (eq. 22 in that paper). These were compared with the cluster probabilities obtained using simulation with power-law distributed gene family sizes, described above.Figure 4 and supplementary S6, Supplementary Material online, show that the one-to-one mapping assumption underestimates cluster probabilities and, hence, overestimates cluster significance. This problem is particularly severe for larger genome and window sizes. For example, compare the number of matches required to reject the null hypothesis at a significance level of 0.001, when n = 22,000 and r = 150 (fig. 4). Under the one-to-one assumption, an experimenter observing six homologous pairs would erroneously conclude that the cluster was significant. In contrast, at least nine homologous pairs are required in order to reject the null hypothesis under the more realistic power-law model. Therefore, unless the experimenter is able to unambiguously identify a unique homolog for each match, the use of the one-to-one homology model will lead to false positives. This effect also occurs for smaller genome and/or window sizes, but is less pronounced.
F
Comparison of orthologous cluster probabilities for power-law distributed gene families and the one-to-one homology model. The probabilities of observing at least one cluster of size m in a window of size r = 150 when genome size (a) n = 14,000 and (b) n = 22,000.
Comparison of orthologous cluster probabilities for power-law distributed gene families and the one-to-one homology model. The probabilities of observing at least one cluster of size m in a window of size r = 150 when genome size (a) n = 14,000 and (b) n = 22,000.
The Influence of Window Size on Significance
In addition to data analysis, our equations can be used to analyze trends in cluster significance, evaluate the impact of parameter choices on cluster probabilities, and to design data analysis protocols. For example, how should the window size, r, be selected in a window sampling analysis? We studied the effect of window size, r, on orthologous gene cluster significance by computing the probability of observing a gene cluster for various values of r using the equations in table 1. Because paralogous gene cluster probabilities follow similar trends, only results for orthologous clusters are given.Figure 5 shows that for given values of n, m, and ϕ, the cluster significance decreases as the size of windows increases. When the number of conserved homologous pairs in a cluster is small, the cluster is significant only for a small range of window sizes. For example, when n = 5,000, ϕ = 2, and m = 5, at a significance threshold of α = 0.0001, clusters are significant only when r ≤ 36 (fig. 5). As the number of conserved homologous pairs grows, the clusters are significant for a wider range of window sizes.
F
The effect of window size on orthologous cluster significance. The probabilities of observing at least one cluster of size m in a window of size r given a genome of size n when (a) n = 5,000, ϕ = 2, and m = 5, m = 8, and m = 10; (b) n = 22,000, ϕ = 3, and m = 5, m = 10, and m = 15.
The effect of window size on orthologous cluster significance. The probabilities of observing at least one cluster of size m in a window of size r given a genome of size n when (a) n = 5,000, ϕ = 2, and m = 5, m = 8, and m = 10; (b) n = 22,000, ϕ = 3, and m = 5, m = 10, and m = 15.When the cluster contains 10 homologous pairs, it is significant when r ≤ 79. Similar trends were found for the larger genome sizes as shown in figure 5. In addition, when the genome size is large, fewer homologous pairs are required to make a cluster significant for a given window size. For example, when n = 22,000, a cluster of five genes is significant when r ≤ 60, approximately.In the last 15 years, many reports of both paralogous and orthologous conserved gene clusters have appeared (surveyed in Abi-Rached et al. 2002; Danchin et al. 2003; Durand and Sankoff 2003). These clusters typically include 5 to 15 homologous pairs, with window sizes ranging from 15 to 300. The results in figure 5 suggest that for the larger window sizes even 15 homologous pairs may not be sufficient to reject the null hypothesis.
Application to a Real Example
Geddy and Brown (2007) used spatial genomic analysis in a recent study of the evolution and functional diversification of genes encoding pentatricopeptide repeat proteins (PPR) in plant genomes. PPR proteins are associated with various RNA processing functions, including processing of RNA transcripts, RNA editing, and initiation of translation. Some are also implicated in plant-specific functions, such as restoration of fertility. Loss of fertility is due to mitochondrially encoded cytoplasmic male sterility genes observed in a number of plant species, including radish and petunia.The spatial organization of PPR genes is highly variable compared with the relatively stable syntenic organization of other genes in the genomic regions in which they are found (Geddy and Brown 2007). In their investigation of the genomic processes driving the distribution of these genes, Geddy and Brown (2007) present partially conserved gene clusters containing PPR genes. The hypothesis that these regions are descended from the same region in an ancestral genome could have been further supported by statistical validation using tests such as those presented here.To demonstrate the relevance of our methods to current genomic studies, we applied our statistics to two of the clusters containing PPR genes identified by Geddy and Brown (2007). The first is an orthologous cluster (fig. 3 in Geddy and Brown 2007) comprised of regions from the Arabidopsis and Ogura radish genomes. The cluster spans 15 genes in the Arabidopsis genome and 6 genes in Ogura radish. Of these genes, four are homologous pairs appearing in both regions. We computed the probability of observing such an orthologous cluster using equation (3) with ϕ = 3, assuming n = 28,000 (http://www.arabidopsis.org, RadishDB: http://radish.plantbiology.msu.edu/). The resulting probability is 6.45 × 10−11, showing that the cluster is statistically significant.The second cluster is a paralogous cluster (fig. 4 in Geddy and Brown 2007) of two genomic regions containing PPR genes in the Arabidopsis genome. These regions contain 19 and 18 genes, respectively, and share 8 homologous pairs. The probability of observing such a paralogous cluster under the null hypothesis is 8.96 × 10−20, computed using equation (12) with ϕ = 3. This suggests that the cluster is highly significant. Thus, our statistical analysis provides further evidence of the shared ancestry of the gene clusters reported by Geddy and Brown (2007).This example underscores the importance of two key features of our tests: They can be applied when one-to-one homology cannot be determined and when whole genome data are not available. A many-to-many homology model is particularly important for the PPR genes because of the difficulty of determining exact homology relationships in this gene family. Evidence that these genes are under diversifying selection contradicts the usual expectation that genes that are most closely related will also be most similar. Moreover, the ability to obtain accurate sequence alignments is challenged by the presence of repeated sequence motifs within these genes. Our methods are also particularly well suited to analysis of these data because the radish genome is not completely sequenced. Thus, statistical tests based on randomization of gene order could not have been applied.
Discussion
Identification of homologous genomic regions is a fundamental component of genome evolution studies, as well as predictive methods that exploit spatial conservation for functional inference. In distantly related genomes, putative homologous regions are found through similarities in local gene content. When spatial organization has been disrupted by genome rearrangements, statistical tests are essential to exclude the possibility that such similarities arose by chance. Although there is a growing statistical methodology for validating gene clusters, practical significance tests that are applicable to noisy and incomplete data have not been realized.Here, we present accurate, efficient statistical tests that meet these needs in two ways: First, our results are appropriate for studies that focus on a single pair of regions containing specific genes of interest. Because they do not require detailed genomic information outside the region of interest, our methods can be used to analyze homologous regions in species for which a genomic map is not available, either because genome sequencing and assembly has not been completed or because the organism under study has not been targeted for genome sequencing. Such data sets are not amenable to statistical tests based on randomization.Second, our tests support a many-to-many homology model, applicable to genome self-comparison and data sets with large gene families. The challenge is to model the distribution of gene family sizes to obtain a test that is both efficient and accurate. Exact calculation of the cluster probabilities assuming an arbitrary distribution is computationally intractable. By assuming that all gene families are of equal size, we obtain an efficient test that can be easily calculated in Mathematica. To evaluate the impact of this simplifying assumption, we used simulation to estimate gene cluster probabilities under the null hypothesis using a realistic model of gene family size distributions. These were obtained by fitting a power law to clustered sequences from the yeast, fly, and human genomes. Remarkably, the results show that our tests closely approximate the null hypothesis, despite the highly unrealistic assumption on which they are based. Comparing the simulated probabilities with our analytical model shows that our tests slightly overestimate cluster probabilities, yielding a test that is accurate and conservative. We also compared previously published tests (Durand and Sankoff 2003) that assume one-to-one homology with our simulation results. The probabilities obtained by assuming one-to-one homology substantially underestimate the cluster probabilities in the simulated genomes leading to erroneous rejection of the null hypothesis. This confirms the need for statistical tests that include a model of gene families.Our results represent a practical balance between accuracy and efficiency that is well suited to analysis of real biological data sets. We demonstrate the utility of our results empirically by applying them to gene clusters in the Arabidopsis and radish genomes from a recent report on the evolution of the pentatricopeptide repeat (PPR) gene family in plants. This data set exemplifies the two practical advantages of our tests. A many-to-many homology model is required for the PPR genes because determining phylogenetic relationships is difficult in this family due to repeated motifs and diversifying selection. In addition, a method suitable for local regions is required because a complete assembly of the radish genome sequence is not yet available.
Supplementary Material
Supplementary figures S1–S6 and supplementary text are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu Journal: Science Date: 2001-02-16 Impact factor: 47.728
Authors: James R Doroghazi; Jessica C Albright; Anthony W Goering; Kou-San Ju; Robert R Haines; Konstantin A Tchalukov; David P Labeda; Neil L Kelleher; William W Metcalf Journal: Nat Chem Biol Date: 2014-09-28 Impact factor: 15.040
Authors: Marten Michaelis; Alexander Sobczak; Dirk Koczan; Martina Langhammer; Norbert Reinsch; Jennifer Schoen; Joachim M Weitzel Journal: BMC Genomics Date: 2017-11-21 Impact factor: 3.969