Rob Patro1, Carl Kingsford. 1. Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. robp@cs.cmu.edu
Abstract
MOTIVATION: Reconstruction of the network-level evolutionary history of protein-protein interactions provides a principled way to relate interactions in several present-day networks. Here, we present a general framework for inferring such histories and demonstrate how it can be used to determine what interactions existed in the ancestral networks, which present-day interactions we might expect to exist based on evolutionary evidence and what information extant networks contain about the order of ancestral protein duplications. RESULTS: Our framework characterizes the space of likely parsimonious network histories. It results in a structure that can be used to find probabilities for a number of events associated with the histories. The framework is based on a directed hypergraph formulation of dynamic programming that we extend to enumerate many optimal and near-optimal solutions. The algorithm is applied to reconstructing ancestral interactions among bZIP transcription factors, imputing missing present-day interactions among the bZIPs and among proteins from five herpes viruses, and determining relative protein duplication order in the bZIP family. Our approach more accurately reconstructs ancestral interactions than existing approaches. In cross-validation tests, we find that our approach ranks the majority of the left-out present-day interactions among the top 2 and 17% of possible edges for the bZIP and herpes networks, respectively, making it a competitive approach for edge imputation. It also estimates relative bZIP protein duplication orders, using only interaction data and phylogenetic tree topology, which are significantly correlated with sequence-based estimates. AVAILABILITY: The algorithm is implemented in C++, is open source and is available at http://www.cs.cmu.edu/ckingsf/software/parana2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Reconstruction of the network-level evolutionary history of protein-protein interactions provides a principled way to relate interactions in several present-day networks. Here, we present a general framework for inferring such histories and demonstrate how it can be used to determine what interactions existed in the ancestral networks, which present-day interactions we might expect to exist based on evolutionary evidence and what information extant networks contain about the order of ancestral protein duplications. RESULTS: Our framework characterizes the space of likely parsimonious network histories. It results in a structure that can be used to find probabilities for a number of events associated with the histories. The framework is based on a directed hypergraph formulation of dynamic programming that we extend to enumerate many optimal and near-optimal solutions. The algorithm is applied to reconstructing ancestral interactions among bZIP transcription factors, imputing missing present-day interactions among the bZIPs and among proteins from five herpes viruses, and determining relative protein duplication order in the bZIP family. Our approach more accurately reconstructs ancestral interactions than existing approaches. In cross-validation tests, we find that our approach ranks the majority of the left-out present-day interactions among the top 2 and 17% of possible edges for the bZIP and herpes networks, respectively, making it a competitive approach for edge imputation. It also estimates relative bZIP protein duplication orders, using only interaction data and phylogenetic tree topology, which are significantly correlated with sequence-based estimates. AVAILABILITY: The algorithm is implemented in C++, is open source and is available at http://www.cs.cmu.edu/ckingsf/software/parana2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Improved techniques for understanding how collections of protein interactions have evolved over time have a number of applications. For example, they can help identify stable and rewired modules (Kreimer ) and protein complexes (Pereira-Leal ). The quality of inferred networks under various parameters can help estimate the probabilities of different evolutionary events (Li ; Middendorf ; Navlakha and Kingsford, 2011) or correct for phylogenetic branch lengths (Zhu and Nakhleh, 2012). Ancestral network reconstruction has been explored to improve network alignment algorithms (Dutkowski and Tiuryn, 2007; Flannick , 2009; Singh ). The study of ancestral metabolic pathways can reveal how changes in metabolic pathways relate to changes in the environment (Borenstein and Feldman, 2009; Borenstein ; Mithani ). Zhang and Moret (2008, 2010) apply network evolution inference to improve inference of regulatory networks in present-day species.Previous algorithms for network history reconstruction include the use of graphical models (Dutkowski and Tiuryn, 2007; Pinney ), greedy local search (Navlakha and Kingsford, 2011) and extensions thereof (Li ; Zhu and Nakhleh, 2012), maximum-likelihood inference (Zhang and Moret, 2008, 2010) and other approaches (Gibson and Goldberg, 2009). Patro introduced a new parsimony framework that modeled the problem as one of finding the fewest number of interaction gain and loss events that reconstruct the observed present-day networks. Many of these previous approaches find only one possible network history and make inferences based on that single history. However, there may be a large number of optimal and near-optimal histories. A priori, we do not know how different these solutions may be, or how representative of the ensemble the solution at which we arrive is. Further, although maximum-likelihood–based approaches do not necessarily produce a single history, Carvalho and Lawrence (2008) suggest that such estimators may not generally characterize the posterior-weighted ensemble of solutions well.A maximum-likelihood network history inference method has been applied to the problem of predicting regulatory interactions in present-day networks (Zhang and Moret, 2008, 2010). However, that approach requires a known complete ordering of the duplication events in each homology group, which our approach does not. Further, being based on a parameterized network evolution model, it requires the estimation of numerous model parameters.To overcome these limitations, we present an approach, based on a novel algorithm and advanced dynamic programming techniques, which is able to efficiently characterize the relevant portion of the space of network histories without resorting to sampling. By formulating our dynamic program in the forward hypergraph framework (Gallo ), it becomes clear how to explore the space of solutions. We develop an extension of the k-best parsing algorithm of Huang and Chiang (2005) that allows us to aggregate solutions of equivalent quality. As a result, rather than enumerating individual solutions, we are able to enumerate solution classes (i.e. the set of all solutions having the same cost) and provide a characterization of the space of optimal and near-optimal solutions to an instance of the network history inference problem. Inspired by Feynman and Brown (1942), we call this method a sum-over-parsimonious-histories (SOPH) approach to ancestral network reconstruction. Although related to certain approaches in natural language processing, our approach for generating a weighted ensemble of parsimonious solutions is novel and may also prove useful in other areas of computational biology.For every potential interaction—either ancestral or extant—our algorithm computes the posterior probability, summed over an ensemble of parsimonious and near-parsimonious histories, which the interaction exists. We show this approach outperforms the graphical model formulation used in Pinney and Dutkowski and Tiuryn (2007). Further, as posterior probabilities are provided for all potential interactions (including extant ones) that participate in the ensemble, we are able to impute missing interactions and to quantify the consistency—in terms of evolutionary parsimony—of a given set of interactions.When applied to the problem of predicting ancestral interactions among bZIPs, the SOPH approach is particularly beneficial when noise is added to the present-day networks. These noisy networks simulate the common scenario in which the measurement of present-day interactions is error-prone. The SOPH method seems to be both accurate and robust. Further, anecdotally, Fossum argue that the interaction between KSHV-1 proteins UL33 and UL31 is highly conserved across many herpes species, and we find that our SOPH approach predicts an ancestral interaction between the orthology groups of these proteins with the second highest probability among all potential ancestral interactions.We test the approach’s ability to predict missing edges in present-day networks, and we show that it often outperforms a state-of-the-art approach for edge prediction based on network topology (Lei and Ruan, 2013). On the bZIP transcription factor network, where we perform leave-one-out, 5-fold and 10-fold cross-validation, we find that our approach most often puts edges from the test set in the top 1% of the probabilities assigned to pairs.We also perform edge prediction on a collection of five herpes virus protein interaction networks (Fossum ) in a similar leave-one-out setting (the data are too sparse for higher-fold cross-validation). Here, the left-out edge is, on average, in the top 25% of high-probability edges. We also breakdown performance based on which orthology groups the interaction participants are members of, and find that the good performance is driven by generally good performance for most pairs of orthology groups. As these data are believed to have high–false-negative rate, there surely are real missing edges in the given present-day networks, meaning the actual performance is likely in fact better.The ensemble of parsimonious network histories encoded by our framework can be used to answer other types of queries about the network histories that are not even possible in existing maximum-likelihood approaches that work based on interaction trees. As an illustration, we use the SOPH framework to predict the relative duplication order between pairs of ancestral bZIP proteins. Existing maximum-likelihood approaches (Dutkowski and Tiuryn, 2007; Pinney ; Zhang and Moret, 2008, 2010) cannot perform this task, as a total order of duplication events is required for the inference procedure used by those algorithms. We find that the relative duplication orders predicted by our SOPH framework, which were predicted without the use of phylogenetic branch length information, are significantly correlated with the duplication order derived from the protein sequences.
2 APPROACH
2.1 Overview
At a high level, we formulate the network history inference problem as a combinatorial optimization problem that seeks a parsimonious or low-cost set of interaction gain and loss events that explain the observed present-day networks. We rewrite the combinatorial problem by encoding it as an instance of the optimal derivation problem on a directed ordered hypergraph that allows us to efficiently count the number of solutions of various costs that are close to the optimal and to compute the probability that any particular interaction gain or loss event is present in the ensemble of near-optimal histories. The ensemble of histories that is compactly encoded by the hypergraph can also be used to answer other queries about the histories themselves, such as inferring the relative duplication order of proteins within a species.
2.2 The network history inference problem
The network history inference problem seeks to find a set of gains and losses of protein interactions that is consistent with both the observed present-day interactions and the phylogenetic history relating the proteins. Formally, we are given present-day networks for species . We are also given a set of binary phylogenetic trees where has leaves associated with a subset of . Every appears as a leaf in at most one tree, and without loss of generality, we may assume that every appears in exactly one such tree. Nodes in each tree are labeled as either protein duplication events or speciation events.An interaction event is a triple , where and . T1 may equal T2 but neither u nor v can be an ancestor of the other. If , the event represents the gain of an interaction between the ancestral proteins u and v. If , it represents the loss of an interaction.Interactions are assumed to be inherited through duplication events. An interaction exists between two proteins if it has been gained between a pair of their ancestors and not subsequently lost. Specifically, given a set of interaction events, an interaction exists between u and v if there are ancestors x of u and y of v such that the event is in , and there are no nodes such is an ancestor of u and a descendant of x, and is an ancestor of u and descendent of y such that is in .We say that a set of interaction events
The network history inference problem is thenis valid if the events are logically and temporally consistent. That is, a gain event occurs only at a time when the edge does not exist, a loss event occurs only when the edge exists and time ranges can be assigned to every node such that events only happen between pairs of nodes that have overlapping time ranges (note that we do not explicitly find these time ranges).reconstructs
if, for all , implies that edge exists if and only if that edge is present in .
Problem 1
(Network History Inference) Find the smallest set of triples on that represents a valid history of and that reconstructs the present-day networks . If a function c(e) that assigns a cost to interaction event e is given, we seek the lowest-cost set .Finding a score-weighted ensemble of solutions to this problem allows us to solve the related problems of (i) predicting ancestral interaction networks; (ii) imputing missing interactions in present-day networks; and (iii) inferring relative orders for duplication events that are consistent with a molecular clock.
3 METHODS
3.1 Directed ordered hypergraphs
We use the hypergraph definition and a number of related definitions given by Huang and Chiang (2005). Specifically, we define a directed ordered hypergraph as , where V is the set of vertices, E is the set of ordered hyperarcs, is a designated root node and is a function assigning costs to the hyperedges. Each hyperarch is a pair , where is a vertex called the head of the hyperarc and is an ordered list of vertices called the tail of the hyperarc. We denote by the ith element of the tail of . Without loss of generality, we will assume that every vertex is the head of some hyperarc ; the tail of e can be an empty list (denoted here as ). A hyperarc with head x and tail is written as .We call the set of hyperarcs with v as their head the backward star of v, and denote it by . Any vertex w that appears in the tail of some hyperarc e where is said to precede v.
3.2 The optimal derivation problem
We will formulate the network history inference problem as an instance of the optimal derivation problem in the ordered hypergraph framework (Huang and Chiang, 2005). This framework allows one to explicitly represent the space of solutions to certain classes of combinatorial problems by encoding these solutions in the topology of a directed ordered hypergraph. Such an ordered hypergraph representation is used in a wide variety of different fields, including natural language processing (Huang and Chiang, 2005; Klein and Manning, 2001) and operations research (Nielsen ). The highly similar directed hypergraph framework was first introduced in computational biology by Finkelstein and Roytberg (1993), where it was shown how many classical dynamic programming problems from sequence alignment to RNA secondary structure prediction could be formulated in this framework. Recently, Ponty and Saule (2011) applied dynamic programming in the directed hypergraph framework to the problem of pseudoknotted RNA folding, and they extended the algorithm to allow the computation of the moments of additive features (e.g. free-energy and helicies).The optimal derivation of an acyclic, directed, ordered hypergraph is , defined recursively by
By traversing the hypergraph in topological order starting with the nodes that have only zero-length tails, the solutions to subproblems are available when needed. This is the basic strategy behind traditional dynamic programming approaches, and the hypergraph representation simply makes the relation between the terms of the recurrence explicit by encoding them in the topological structure of the hypergraph. Each vertex in the hypergraph represents a term of the recurrence, and the hyperarcs encode the sub-terms (tail nodes of the arc) on which a term (head node of the arc) depends (as illustrated, e.g. in Fig. 1).
Fig. 1.
Mapping recurrence to a hypergraph. An illustration representing a particular recurrence term in the hypergraph. Each hyperarc encodes a set of subterms that must be evaluated to provide a solution to the head vertex . The arrows denote derivations with back-pointers, and they show the first (dashed blue), second (dashed orange), third (solid blue) and fourth (solid orange) best derivations of the head vertex, and which derivations of the tail vertices were used to achieve them
Mapping recurrence to a hypergraph. An illustration representing a particular recurrence term in the hypergraph. Each hyperarc encodes a set of subterms that must be evaluated to provide a solution to the head vertex . The arrows denote derivations with back-pointers, and they show the first (dashed blue), second (dashed orange), third (solid blue) and fourth (solid orange) best derivations of the head vertex, and which derivations of the tail vertices were used to achieve them
3.3 Network history inference as optimal derivation
The network history inference problem can be encoded as an instance of the optimal derivation problem as follows. We set the hypervertices of the hypergraph H to be
Node in H represents whether there is an interaction between proteins u and v just before either of the proteins duplicate. We exclude from any hypervertices involving proteins that cannot have an interaction between them because one is an ancestor of the other or because they are in different species.Let u and u denote the left and right children of node u. For every hypernode where u and v are not leaves, we have the following hyperarcs:
The hyperarcs aforementioned encode the option of recursing into either the children of u or the children of v and the option of losing the u–v interaction [Equations (4) and (6)] or not losing it [Equations (3) and (5)]. The cost of hyperarcs (4) and (6) is the cost of a loss event, and the cost of hyperarcs (3) and (5) is 0. The analogous hyperarcs exist for head nodes of the form , with and switched. Finally, for those hypervertices where (representing potential homodimer interactions), the incoming hyperarcs are slightly different. Specifically, denoting present as p and absent as a, a hypervertex appears as the head of the following hyperarcs:
The hyperarc in (7) encodes the recurrence where a homodimer interaction for protein u is inherited by its progeny, implying the edges and . The hyperarc in (8) encodes the recurrence in which the homodimer interaction is lost before u’s duplication. Just as with Equations (3–6), the analogous hyperarcs exist for hypervertices of the form with and switched.If u or v is a leaf, we omit the hyperarcs above that would involve u or v’s non-existent children. If both u and v are leaves, we add the trivial hyperarcs and with an empty tail. In this case, the cost of is the cost of a loss event if edge exists in the observed present-day networks and 0 otherwise; the cost of is the cost of an interaction gain if present-day edge exists. We can also assign equal, non-zero costs to leaf nodes to designate that the state of an interaction is unknown rather than present or absent.
Theorem 2
Let
be the set of hyperarcs used in an optimal derivation of the hypergraph defined earlier in the text. Let set
contain a gain event corresponding to every hyperarc in
that transitioned from
to
and let
further contain a loss event corresponding to every hyperarc that transitioned from
to
. Then
is the lowest-cost solution to the network history inference problem (Problem 1), except that
may contain temporally inconsistent events.We omit the proof of Theorem 2 because of space, but it follows directly from the proof in Patro , translated into the hypergraph framework. The issue of allowing temporally inconsistent events is apparently what makes Problem 1 difficult. Here, we hope to mitigate the effect of temporally inconsistent solutions by summing over many near-optimal solutions.In the rest of this article, we will refer to solutions having minimum cost as optimal, regardless of their inclusion of temporally inconsistent events. Thus, when we say a solution is optimal, we mean that it has the absolute minimum cost with regard to the parsimony criteria of any history generating the extant interactions. This is justified as, in practice, such temporally inconsistent optimal solutions seem to be rare (Patro ). When we say that a solution is near-optimal, we mean that it is optimal or it has a cost close to that of an optimal solution; it need not be temporally consistent.All of the extensions to the recurrence and cost function described in Patro can be encoded in the hypergraph framework, including directed edges, asymmetric interaction gain and loss costs and weighted branch length costs.
3.4 Counting optimal and near-optimal solutions
There may be many near-optimal derivations representing different network histories. We would like to use all these histories to compute probabilities for particular events (e.g. interaction events or an order of duplication events) to have occurred. First, we show how to compute the number of derivations of various costs.Let be the set of the jth-best derivations rooted at hypervertex x. That is, is the set of optimal derivations, and is the set of non-optimal derivations with cost as close to optimal as possible. We call a cost class, and let denote the cost of each derivation in .We want to accumulate the sizes of the top-k cost classes of the root of the hypergraph. This will give us a distribution of the costs of near-optimal derivations; thus, a distribution of costs of near-optimal network histories. That is, we would like to compute for for some k. In general, this constitutes many more than the top-k individual solutions because there are many ways to obtain different solutions of equivalent cost. The key to developing an efficient algorithm for this task is to realize that we can count all derivations belonging to the top-k cost classes of a vertex without enumerating them.Every derivation D in is built up from a choice of hyperarc combined with (potentially near-optimal) choices of derivations of each of the members of the tail of that hyperarc. Derivation D thus includes some subderivations , where d is the index of the cost class used in the subderivation for t for derivation D. However, the size does not depend on the specific choices of but only their cost classes d. Specifically, a particular choice of hyperarc and of a set of leads to
possible derivations of the same cost . Let represent a vector of choices of cost classes [e.g. ]. Then the size of a cost class can be expressed recursively by combining Equation (9) with
Unfortunately, implementing the aforementioned sum directly would be computationally expensive, as it involves summing over many choices of . However, we can exploit the fact that derivations in cost class j + 1 are related to derivations in cost class j in the following way. Denote by the vector having a 1 in its th position and a 0 everywhere else. Then, we define the neighborhood of a pair to be the set
. In other words, is the set of choices for cost classes for the subderivations that use the same cost classes as except for one item in the tail of e, for which the next higher cost class is used. We then have the following lemma.
Lemma 3
Let
be a derivation that falls in cost class
. Then any derivation in
is in
.Again, for space, we omit a full proof, but the lemma is intuitive: to go up one cost class you should only change the cost class used for one of the subderivations. This is the essential observation behind so-called cube pruning and cube growing approaches (Gesmundo and Henderson, 2010; Huang and Chiang, 2005) for enumerating k best derivations.Lemma 3 implies that we can efficiently enumerate the top-k cost classes for a vertex x by maintaining a priority queue of the potential best derivations that allows us to walk from the optimal class with to higher cost classes. The priority queue is initially populated with . When a derivation is removed from the queue, its neighbors are added to the priority queue sorted by their cost, and this process continues until all derivations have been exhausted or until the top-k cost classes have been enumerated. Lemma 3 guarantees that all items in cost class j + 1 will be processed after those in cost class j. Algorithm 1 formalizes this process. The process can be made even more efficient using the faster cube pruning approach introduced by Gesmundo and Henderson (2010).To compute for all hypervertices x, we process hypervertices in topological order starting from the leaves and moving up the hypergraph. Each leaf has only two derivations. The cost classes for non-leaf hypervertices can be computed via Algorithm 1. As the cost function is monotonically increasing within each edge, to obtain the top- k cost classes at a vertex x, it will always be sufficient to have computed the top- k cost classes for all of x’s preceding vertices. This algorithm is similar to Algorithm 2 from Huang and Chiang (2005), except that cost classes of derivations with equivalent costs using different hyperarcs are merged. A simple example of this algorithm is illustrated in Supplementary Figure S4
3.5 Estimating probabilities of network history events
We now describe an algorithm that can use the counts derived via the algorithm in Section 3.4 to estimate probabilities for network history events (i.e. the interaction state or relative duplication order of ancestral proteins) based on how often they occur in the ensemble of near-optimal solutions. At a high level, the algorithm distributes a probability mass at hypervertices in accordance with how frequently they appear in near-optimal solutions.
Assigning weights to cost classes.
Events that occur in derivations in low-cost classes are intuitively more believable than those that occur in very high-cost classes. We must decide the relative weight placed on these classes. If there is only a single cost class, all of the weight is assigned to the solutions from the class. Otherwise, a cost class of cost is assigned weight using following equation involving a user-provided parameter γ:
where is a normalizing constant, s ranges over the costs of all cost classes associated with vertex x, and and are shorthand for the minimum and maximum costs for the computed cost classes of x. When γ is large, cost classes are given near-equal weight. At low γ, high-cost classes count for little.
Assigning probabilities to hyperarcs.
Algorithm 2 traverses H in reverse topological order starting from the root. For every hyperarc , it computes a probability that is equal to the sum of the fractions of time that this arc was used in each cost class, with each cost class weighted according to function w aforementioned. Specifically, let
be the conditional probability that a derivation of hypervertex x will use hyperarc given that the derivation is of cost . gives the number of times e was used in a derivation in . Then, the total probability of hyperarc is given as
where the sum runs over cost classes at x. Therefore, the probability mass contributed to hyperarc e by cost class is the weight of this cost class times the conditional probability that e was used in a derivation in .
Assigning probabilities to hypervertices.
The probability assigned to a hypervertex x is the sum, over all hyperarcs e where , of the probability of the hypervertex times . That is, for every hyperarc e with x appearing in its tail, the probability mass deposited at x by e is the total probability of the head (say, y) of this hyperarc times the probability that e is used in a near-optimal derivation of y. In actuality, two probabilities, an ‘in’ and ‘out’ probability are computed for each vertex. This is described in greater detail in Supplementary Section S2.
3.6 Predicting interactions
As the hypergraph encodes as its vertices all potential protein interactions—both extant and ancestral—the task of predicting scores for such interactions is straightforward. After running Algorithm 2, to determine a probability for edge existing, we simply look at the ‘out’ probability assigned to hypervertex . Note that the pair may not be considered in all potential histories, as different relative duplication orders may lead to histories in which u and v never co-exist. However, if one assumes that u and v co-exist, one can condition the relevant probabilities based on that assumption and compute the conditioned probability , where . Interestingly, one can use these same probabilities to compute a probability, according to the ensemble of parsimonious histories, which a pair of proteins actually co-existed.Probabilities are also computed for extant pairs of proteins. One way to view these scores is as a phylogenetic smoothing of the input networks. This suggests that we may use the output scores of potential interactions to identify specific interactions that we would or would not expect to see given the duplication histories and the rest of the observed interactions.To predict potential extant interactions, we consider pairs of proteins with no interaction in the input data but between which the probability output by Algorithm 2 is relatively high. For example, if interlogs exist for the potential edge in evolutionarily close species, then we expect that the derivations for a reasonable fraction of parsimonious and near-parsimonious histories will rely on the present interaction state between these two proteins (even though, taken in isolation, the present state will have a higher cost than the absent state). Thus, the algorithm is using all of the information encoded by the input interactions and the protein phylogeny to jointly determine the probability with which we expect to observe a given edge.
3.7 Estimating relative duplication order
We can also use the ensemble of parsimonious histories encoded by our method to compute a probability for the relative duplication order of a pair of ancestral proteins u and v. Let
where denotes the set of ancestors of a protein. P is simply the sum of probabilities that the children of u existed before the children of itself, or any ancestor of v. P is defined analogously, swapping the roles of u and v. Then P represents the sum of probabilities over parsimonious histories that u duplicated before v, whereas P represents the probability, in our ensemble, that v duplicated before u. Thus, to predict the relative duplication order of u and v, we can simply compare the probabilities P and P and predict that the protein having the larger of the two probabilities was the first to duplicate.
3.8 Data and testing methodology
3.8.1 bZIP transcription factors
To evaluate the ancestral network reconstruction task, we use the bZIP family of proteins. Similar tests were first performed by Pinney , who produced these data. The bZIP transcription factors make an enticing set of data on which to test methods for ancestral network reconstruction because the interactions between these transcription factors are strongly mediated by their coiled-coil leucine zipper domains, and the strength of these interactions can be computationally predicted with high sensitivity and specificity using sequence alone (Fong ). This means that the interaction affinity of ancestral proteins can be estimated with reasonably high confidence by first estimating the ancestral sequence and then performing a sequence-based prediction of the interaction affinity between the ancestral protein sequences. This sequence-based method was used to predict the interaction strength between both extant and inferred ancestral bZIP proteins sequences. The predicted affinities among present-day proteins were used to generate the extant interactions. Affinities among ancestral proteins were taken as the ‘ground truth’ ancestral interactions (Pinney ).We experiment with three different variations on these data. The original data consist of interaction scores as predicted by the software of Fong . This software computes a score for each pair of proteins, which predicts the affinity of their potential interaction. Higher scores are assigned to pairs of proteins for which the model predicts a greater propensity for a strong interaction between these proteins. Present-day interactions were created between those pairs for which the interaction score is ≥30.6 (the score for which the probability of an interaction existing given the score is 0.5) (Pinney ). To create two noisy versions of the data, Gaussian noise with mean 0 and standard deviations of 10 and 20 was added to the original scores [which were in the range ], which were then converted to binary interactions as in the original dataset.
3.8.2 Herpes viruses
Protein interaction data. We use the whole proteome interaction networks of five different herpes viruses experimentally determined by Fossum . The viruses are the Epstein–Barr virus, herpes simplex virus 1 (HSV-1), murine cytomegalovirus (mCMV), Kaposi’s sarcoma-associated herpesvirus (KSHV) and the varicella-zoster virus. Together, the viruses span the α, β and γ herpesvirus subfamilies and represent a sampling of viruses, which have diverged substantially, as the speciation of their common ancestor ∼400 M years ago (McGeoch and Gatherer, 2005; McGeoch ). Despite this divergence, there is still a set of core orthologs that are present in all of the species.Gene and species trees. We use the species tree representing the relationships between the five herpes virus species given by (McGeoch and Gatherer, 2005; McGeoch ). For each of the proteins in the core orthology groups assigned by Fossum , we obtained the sequences from the UniProt database (The UniProt Consortium, 2012). We then constructed gene trees for each of the orthology groups using PyCogent (Knight ). Finally, the gene trees were rooted and reconciled with the species tree using the Notung 2 software (Durand ; Vernot ).Leave-one-out cross-validation on pairs of orthology groups. Given the reconciled gene trees for each core orthology group and the high-confidence interactions reported by Fossum , we perform our cross-validation experiments as follows. Let O denote the set of core orthology groups, and for each pair of groups in , let I denote the set of interactions within and between groups a and b. For each pair of orthology groups where , we remove each interaction i in I in turn, while leaving the remaining interactions fixed. This yields a problem instance consisting of the reconciled trees T and T for orthology groups a and b, and the set of interactions . We run SOPH on this instance, and we record the score assigned to each potential interaction. We rank the potential interactions according to their probabilities, and we report the relative rank of i, the left-out interaction, among the list of potential, non-input interactions. In other words, let L and L denote the leaf nodes of T and T (not considering nodes marked as lost by the reconciliation algorithm) and . Then, we consider all potential interactions , where , and sort them in descending order according to their assigned scores. The requirement that enforces that is only a potential interaction if u and v belong to the same species. We compute the relative rank of i in this list as . Ideally, the relative rank should be low, indicating that the left-out edge was near the top of the list of predicted interactions. The relative rank is always in the range of 0–1 (inclusive), and if the ranks were assigned randomly, we would expect the left-out interaction to have relative rank of 0.5 on average (this property holds empirically).
4 RESULTS
We now describe the performance of the SOPH framework for the three inference tasks—ancestral network reconstruction, missing interaction imputation and determination of relative protein duplication order—set forth in Section 1.
4.1 Reconstructing bZIP ancestral networks in the presence of noise
For each of the noise levels of the input data (), we reconstruct three ancestral networks—Teleost (ancestor of Danio rerio and Takifugu rubripes), vertebrata (ancestor of D.rerio, T.rubripes and Homo sapiens) and chordate (ancestor of D.rerio, T.rubripes, H.sapiens and Ciona intestinalis). For all experiments, we use the top k = 40 cost classes and set γ, the parameter that determines the relative weight of the different cost classes to .To measure the quality of the ancestral network reconstruction, we use three separate metrics, the BEDROC score (Truchon and Bayly, 2007), the area under the ROC (AUROC) and the area under the precision-recall curve (AUPR). The BEDROC metric is an AUC metric meant to deal with the so-called early enrichment or early recognition problem. Intuitively, the BEDROC metric weights the accuracy more heavily early on in the retrieval list. This is appropriate for the task of inferring ancestral interactions because we expect the density of such interactions to be relatively low, and because we care most about those inferred ancestral interactions in which we have high confidence. When computing BEDROC scores, we set the early-recall parameter α to 20.0 as suggested by Truchon and Bayly (2007).Table 1 demonstrates the performance of our ancestral network reconstruction procedure compared with the single-history parsimony approach (Patro ) and the probabilistic model used by Pinney . We find that our method often outperforms both other methods under the three metrics shown in Table 1.
Table 1.
Ancestral network reconstruction accuracy of several methods under various levels of noise σ
Ancestor
Method
BEDROC ()
AUROC
AUPR
Vertebrata
SOPH
Parsimony
Probabilistic
Teleost
SOPH
Parsimony
Probabilistic
Chordata
SOPH
Parsimony
Probabilistic
Note: The performance of our SOPH approach, a single-history parsimony approach (Patro ) and the probabilistic method described by Pinney et al. in reconstructing the ancestral interaction networks we consider.
Ancestral network reconstruction accuracy of several methods under various levels of noise σNote: The performance of our SOPH approach, a single-history parsimony approach (Patro ) and the probabilistic method described by Pinney et al. in reconstructing the ancestral interaction networks we consider.These results show the potential benefit of using the SOPH approach to the ancestral network reconstruction problem, especially in the typical situation where the error rates of measured present-day interactions can be very high (Stumpf ). More generally, the results demonstrate that the probabilistic method, although clearly more robust to noise than the naïve parsimony approach, is not inherently superior in this aspect to advanced methods based on parsimony. In particular, the results of the SOPH approach with noisy input interactions suggests that a method based on analyzing an ensemble of parsimonious solutions can exceed the accuracy of methods based on maximum likelihood. By exploring all near-optimal parsimonious histories, SOPH is able to overcome one of the main shortcomings of previous parsimony-based approaches and to provide substantially better performance, in most cases, than any of the pre-existing methods.We note that the cases in which the maximum-likelihood approach is most competitive with SOPH is in the most ancient ancestral species. However, this is also the species in which we have the least confidence in the ground-truth data, as ground-truth ancestral interactions were computed based on the interaction scores of inferred ancestral sequences.We cannot perform a similarly exhaustive validation of the ancestral predictions for the herpes virus networks because we do not have a general scheme for determining ground-truth ancestral interactions. Anecdotally, however, we note that the second highest probability ancestral interaction predicted by our method in the common ancestor of all five herpes virus species was between the orthology groups containing HSV-1 proteins UL33 and UL31. The interactions between these orthology groups were posited by Fossum to be highly conserved in their study of evolutionarily conserved protein interaction in the herpes networks; suggesting that this interaction likely did exist in the ancestral network.
We also test the accuracy of SOPH for predicting missing extant interactions in the present-day bZIP networks. Let L denote the set of leaves (i.e. extant proteins) in T, and let I be the set of ground-truth interactions among L. We define as the universe of potential interactions.We performed leave-one-out (LOO) and 10- and 5-fold cross-validation (CV) to test the accuracy of our imputations. In LOOCV, the mean relative rank of the left-out interaction is 0.05 and the median relative rank is 0.01. We observe a minor decrease in performance, with mean ranks of 0.06 and 0.08 and median ranks of 0.01 and 0.02, when simulating lower data coverage with 10- and 5-fold cross-validation. Alternatively, the edges predicted by the RWS (Lei and Ruan, 2013) method have higher mean relative ranks of 0.12, 0.14 and 0.18 and median relative ranks of and 0.1 on LOO, 10-fold and 5-fold cross-validation tests, respectively.The histogram of relative ranks among all experiments (10-fold CV results; Fig. 2) displays a highly skewed distribution for both methods, but SOPH clearly assigns relative ranks closer to 0 (the optimum) for most edges. In fact, with the SOPH predictions, the vast majority of the testing edges appear in the top 2% of the potential edges, and the frequency of lower probabilities for the true left-out edge falls off exponentially. This suggests that our algorithm is able to identify missing present-day edges with high accuracy.
Fig. 2.
A histogram of the relative ranks of the left-out edges in 10-fold cross-validation experiments on the bZIP network
A histogram of the relative ranks of the left-out edges in 10-fold cross-validation experiments on the bZIP network
4.3 Imputing present-day interactions in herpes viruses
We also applied our network history inference framework to predict missing edges in herpes viruses. Unfortunately, the core orthology groups are small enough, and the interactions between and within them sparse enough, that the testing methodology precludes anything other than leave-one-out cross-validation (described in Section 3.8). We test the relative ranks of the SOPH predictions, as well as those of RWS and a variant thereof (RWS) where predicted self-loops are removed in a post-processing step. Although our method only considers the interactions within and between each pair of orthology groups in isolation, RWS is provided with the entire core interaction network for each test. The distributions of relative ranks for the three different methods are shown in Figure 3.
Fig. 3.
Histogram of the ranks of the left-out edge in cross-validation experiments for the herpes virus networks. RWS denotes the standard RWS method, which always predicts homodimer interactions. As the core herpes network has few such interactions, the resulting predictions are poor. For RWS, all homodimer predictions produced by RWS were set to 0. This extra information substantially improves the RWS predictions; however, such information is usually not known when performing an edge imputation task. SOPH, on the other hand, can effectively predict homodimer interactions on a protein-by-protein basis
Histogram of the ranks of the left-out edge in cross-validation experiments for the herpes virus networks. RWS denotes the standard RWS method, which always predicts homodimer interactions. As the core herpes network has few such interactions, the resulting predictions are poor. For RWS, all homodimer predictions produced by RWS were set to 0. This extra information substantially improves the RWS predictions; however, such information is usually not known when performing an edge imputation task. SOPH, on the other hand, can effectively predict homodimer interactions on a protein-by-protein basisAcross all homology groups, the relative ranks computed by SOPH for the left-out interactions are substantially lower than we would expect by chance, with a mean relative rank of 0.23, and median rank of 0.16, indicating that the left-out edge is nearly always in the top 25% of the possible edges. The unmodified RWS predictions obtain mean relative rank of 0.78 and median of 0.79. This is primarily because of the fact that RWS always predicts the existence of a homodimer interaction. In certain protein families with a high homodimerization rate (like bZIP), such predictions are often accurate. However, in the herpes networks, where homodimers are rare, these predictions are problematic for RWS. Thus, we also tested a variant of the RWS predictions (RWS) where all predicted homodimer scores were set to 0 in a post-processing step. This results in a mean relative rank of 0.14 and a median relative rank of 0; a huge improvement in performance over the unmodified RWS predictions. Setting any homodimer scores to 0 also improves the SOPH predictions, but not as drastically. However, such information about the homodimerization rate is not often known a priori, and one strategy is not always better than the other (e.g. RWS outperforms RWS in bZIP). Thus, although RWS either predicts the existence of all or no homodimer interactions, SOPH can predict them effectively on a protein-by-protein basis.Some of the cases in which the experimentally deleted edge is given a low probability (by either method) are likely because of interactions, which are surprising from an evolutionary perspective, or simply a result of the sparsity of the input dataset. In particular, as the experimental dataset used to perform these tests is hypothesized to have a relatively high–false-negative rate itself (Fossum ), it is likely the case that the evolutionary evidence to improve the prediction of edges is simply missing.
Performance on individual pairs of orthology groups.
Figure 4 provides a heatmap of the average relative rank of imputed interactions between pairs of orthology groups. Because of the sparsity of the initial data and the presumed low density of the true interaction networks, many pairs of orthology groups contain one or zero interactions between them (they appear white in Fig. 4) and are left-out of the experiment. Among the remaining groups, we notice a somewhat bimodal distribution of relative ranks. Between many pairs of groups, the missing interactions can be perfectly imputed (relative rank of 0), whereas between others, the task seems incredibly difficult (e.g. between groups 4 and 24 the average relative rank of the left-out edge was 0.63). Again, this suggests that when there is sufficient evolutionary evidence, missing interactions can be imputed with high accuracy. Because we do not have a true gold-standard set of interactions, we cannot reliably hypothesize whether the imputed interactions with large relative ranks are because of a failure of the method (i.e. evolutionarily non-parsimonious interactions) or simply false-negatives in the input data.
Fig. 4.
Imputing missing interactions (per-group). A heatmap of the average relative ranks of the left-out edge between pairs of orthology groups
Imputing missing interactions (per-group). A heatmap of the average relative ranks of the left-out edge between pairs of orthology groups
4.4 Inferring relative order of duplications
Because probabilities for several kinds of evolutionary events can be extracted from the ensembles of histories, the method can also be used to estimate the relative duplication order of ancestral proteins as described in Section 3.7. Accurate independent measurements of duplication order can help validate branch lengths and also estimate the relative ordering of speciation events.We will compare our inferred duplication order with the duplication order implied by an ultrametric embedding of the bZIP phylogeny. The branch lengths inferred on the original bZIP tree [constructed via PAML (Yang, 1997)] do not satisfy the ultrametric property; thus, they are not consistent with a molecular clock and cannot be directly used to infer duplication order. By embedding the given branch lengths into an ultrametric tree using the method provided in Huerta-Cepas , we obtain consistent relative orderings based on sequence information alone.To measure the agreement between the orders inferred by SOPH and those inferred by sequence, we compute the standard Kendall statistic among all intra-species pairs of gene duplication events that were not related by direct evolution. Let N be the total number of tested pairs, n be the number of such pairs that the two methods place in the same relative ordering, n be the number of pairs for which they disagree, t be the number of tied pairs given the tree ordering and t be the number of tied pairs given the SOPH ordering. Then
.Among all 5194 relevant pairs, there are 3745 concordant and 1349 discordant pairs, leading to . This correlation is highly significant, with a P-value of , using the analytic estimation of variance suggested in Hazewinkel (2000). When performing the aforementioned test, we did not supply SOPH with the branch lengths, and the method did not use any information about the relative duplication order or protein sequences apart from the ancestral relationships encoded in the tree topologies themselves. This relatively good performance means that there is a substantial amount of information about the relative duplication order of proteins encoded in the network. The relatively high and low P-value indicate that the SOPH approach is able to reconstruct, in a largely independent way, the relative order of duplication events.We note that if we use the duplication order implied by the non-ultrametric version of the bZIP phylogeny—where the branch lengths still encode evolutionary information but cannot be interpreted directly as representing evolutionary time—we obtain a of 0.20 (3109 concordant and 2084 discordant pairs) and an associated P-value of 0.002. This suggests that SOPH can be useful in providing a separate and not-often considered source of information (extant interaction networks) when attempting to determine a consistent set of branch lengths and duplication orders.
5 CONCLUSION
We have introduced a novel sum-over-histories method for solving the network history inference problem. It addresses shortcomings of existing methods by using a weighted ensemble consisting of all optimal and near-optimal parsimonious histories. We show that this makes the results robust to the presence of noise in the input (Section 4.1) and allows our parsimony approach to outperform the probabilistic approach to ancestral network reconstruction (Pinney ) at all considered noise levels.The algorithms we present have practical running times. Our implementation required only 1.5 min to compute—in serial—the results for all cross-validation experiments on the herpes virus datasets (an average of <1 s per experiment). On the significantly larger bZIP dataset, the algorithm requires 6.5, 8.5, 10.4, 12.4, 14.4, 16.9 and 34.4 s, respectively, to compute solutions using the top 1, 10, 20, 30, 40, 50 and 100 cost classes, suggesting an empirically linear relationship between the running time and the number of requested cost classes.The sum-over-histories approach is also general and allows many other questions to be answered about how a sequence of proteins, and their interactions have evolved. We find that our method can reliably exploit evolutionary evidence to discover the existence of missing interactions. SOPH may be useful in prioritizing low-throughput but high-accuracy protein interaction experiments by suggesting which interactions are more likely than others to exist given the current experimental and evolutionary evidence. The SOPH approach also recovers the ultrametric temporal ordering relationships between duplication events well, without using any direct information about sequence or branch lengths.
Authors: John W Pinney; Grigoris D Amoutzias; Magnus Rattray; David L Robertson Journal: Proc Natl Acad Sci U S A Date: 2007-12-12 Impact factor: 11.205
Authors: Rob Knight; Peter Maxwell; Amanda Birmingham; Jason Carnes; J Gregory Caporaso; Brett C Easton; Michael Eaton; Micah Hamady; Helen Lindsay; Zongzhi Liu; Catherine Lozupone; Daniel McDonald; Michael Robeson; Raymond Sammut; Sandra Smit; Matthew J Wakefield; Jeremy Widmann; Shandy Wikman; Stephanie Wilson; Hua Ying; Gavin A Huttley Journal: Genome Biol Date: 2007 Impact factor: 13.583