Literature DB >> 18001486

Making the most of high-throughput protein-interaction data.

Abstract

We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 18001486 PMCID： PMC2246275 DOI： 10.1186/gb-2007-8-10-112

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Most protein functions involve their interaction with other molecules, often with other proteins in the assembly of operational complexes. A better understanding of protein interactions is fundamental to the study of biological systems. As many drugs act on proteins, it is also a prerequisite for understanding intended, and unintended, drug effects. Over the past few years a number of large-scale experiments have set out to map protein interactions systematically [1-15]. While there is interest in combining the resulting data, there appear to be substantial discrepancies between experiments, and evaluation studies have reported large error rates, lack of overlap and apparent contradictions between the different datasets [16-21]. The purpose of this article is to critically assess the methodology used to analyze protein-interaction datasets. When interpreting individual experiments or combining datasets from different experiments, we need to consider three questions: first, what do we want to know and which experiments provide data that can be used to answer our questions; second, which types of protein interactions were assayed and under what conditions; and third, what types of measurement errors may have occurred and what is their prevalence. In this article we will discuss how the formulation of appropriate statistical models can allow investigators to clearly identify and estimate quantities of interest. We will consider two particular types of protein interactions: physical interactions, and interactions between members of a protein complex - which we shall call 'complex membership interactions'. A physical interaction is a direct and specific contact between a pair of proteins [22]. We regard two proteins in a complex as having a physical interaction if they share an interaction surface. A complex membership interaction exists between proteins that are part of the same multiprotein complex and does not necessarily imply a physical interaction.

Sampling and coverage

The two most widely used experimental techniques for detecting protein-protein interactions are the yeast two-hybrid (Y2H) system [23] and affinity purification followed by mass spectrometry (AP-MS) [24]. The Y2H system assays whether proteins can physically interact with each other. Large-scale experiments are carried out in a colony-array format, in which each yeast colony expresses a defined pair of 'bait' and 'prey' proteins that can be scored for reporter gene activity - indicating interaction - in an automated manner [1,6,25]. The type of information obtained from a Y2H experiment is shown in Figure 1. In an AP-MS experiment, a tagged protein is expressed in yeast and then 'pulled down' from a cell extract, along with any proteins associated with it, by co-immunoprecipitation or by tandem affinity purification. The set of pulled-down proteins is identified by MS. In a laborious and expensive process, this procedure has been systematically applied to large sets of yeast proteins [7-11]. The tagged protein in AP-MS is also sometimes called the bait and the proteins it pulls down the prey. The information on protein complexes given by Y2H and AP-MS experiments is compared in Figure 2.

Figure 1

Figure 2

The manifestation of protein complexes in Y2H and AP-MS data. AP-MS experiments measure complex co-membership, and the fact that a prey is found by a certain bait means that there is either a direct physical interaction or an indirect physical interaction mediated by a protein complex. The set of proteins pulled down by a particular bait cannot therefore be equated with a single complex: if the bait is part of several different complexes, then the set of prey will be the union of all proteins in all complexes. (a) Protein B is involved in three different multiprotein complexes. In two of these it directly interacts with C, which itself can also interact with proteins F, G or H, whereas in the third complex, B interacts with D and E. (b) Assuming there are no other interactions under the conditions of the experiment, the bipartite graph between proteins B, ... H and complexes 1, 2, and 3 will look like this. (c,d) The result of a hypothetical AP-MS experiment with no false positives and no false negatives when (c) B is used as a bait and (e) F is used as a bait. (e,f) Result from a hypothetical Y2H experiment with a genome-wide set of preys and with no false positives and false negatives when (d) B is used as a bait and (f) F is used as a bait. (g,h) The results of (g) an ideal AP-MS experiment and (h) an ideal Y2H experiment if all proteins were used as baits. The Y2H data in (e,f,h) identifies the direct interactions, but it does not contain information on the number and architecture of the complexes. The maximal cliques identified by the AP-MS experiment in (g) correspond to the complexes in (a). However, the AP-MS data do not contain information on the topology of the direct interactions within each complex.

Interpreting results on direct physical interactions from Y2H experiments. (a) The observation of interactions A-B and B-C in a Y2H experiment does not indicate whether the two interactions can take place simultaneously (center) or whether they are exclusive of each other (right). (b) The ability of two proteins to interact may depend on post-translational modifications whose presence or absence may be actively regulated. Proteins D and E interact (center) in the absence of a certain post-translational modification (red shape), whose presence inhibits the interaction (right). The manifestation of protein complexes in Y2H and AP-MS data. AP-MS experiments measure complex co-membership, and the fact that a prey is found by a certain bait means that there is either a direct physical interaction or an indirect physical interaction mediated by a protein complex. The set of proteins pulled down by a particular bait cannot therefore be equated with a single complex: if the bait is part of several different complexes, then the set of prey will be the union of all proteins in all complexes. (a) Protein B is involved in three different multiprotein complexes. In two of these it directly interacts with C, which itself can also interact with proteins F, G or H, whereas in the third complex, B interacts with D and E. (b) Assuming there are no other interactions under the conditions of the experiment, the bipartite graph between proteins B, ... H and complexes 1, 2, and 3 will look like this. (c,d) The result of a hypothetical AP-MS experiment with no false positives and no false negatives when (c) B is used as a bait and (e) F is used as a bait. (e,f) Result from a hypothetical Y2H experiment with a genome-wide set of preys and with no false positives and false negatives when (d) B is used as a bait and (f) F is used as a bait. (g,h) The results of (g) an ideal AP-MS experiment and (h) an ideal Y2H experiment if all proteins were used as baits. The Y2H data in (e,f,h) identifies the direct interactions, but it does not contain information on the number and architecture of the complexes. The maximal cliques identified by the AP-MS experiment in (g) correspond to the complexes in (a). However, the AP-MS data do not contain information on the topology of the direct interactions within each complex. An appreciation of the concepts of sampling and coverage is vital for interpreting the data from these types of experiments [26,27]. The term 'sampling' is used for experimental designs where only a subset of the population is interrogated. Representative sampling techniques are used in many fields of science, but they are not common in the generation of protein-interaction datasets, where sampling has often been guided by biological priorities. The 'coverage' summarizes which part of the total set of possible interactions has actually been tested. Even when genome-wide screening was intended [1,10,11], coverage was in fact well below 100%, and the success for each bait seems to depend on nonrandom biological, technological and economic factors. For example, Gavin et al. [10] used all 6,466 open reading frames (ORFs) that were at that time annotated in the Saccharomyces cerevisiae genome and obtained tandem affinity purifications for 1,993 of those. The remaining 4,473 (69%) failed at various stages, because, for example, the tagged protein failed to express or protein bands were not well separated by gel electrophoresis. Thus, neither the set of tested baits nor the set of tested prey in current experiments are random subsets of all proteins in the organism and in general, it is not valid to make inferences about the 'population', that is, the set of all physical interactions that take place in a cell under the conditions being studied, by assuming the available experimental data from a Y2H or AP-MS experiment to be a representative sample. We are not arguing that random sampling be used, as it would not be appropriate in this setting, but rather that the data need to be interpreted more judiciously. One problem in evaluating large-scale protein-interaction experiments is that the published data are often not sufficiently detailed to allow accurate description of the sets of baits and prey that were actually tested. As a proxy, we introduced the concept of 'viable baits' and 'viable prey' [28]. The first is the set of baits that were reported to have interacted with at least one prey, and the latter are those proteins reported to be found by at least one bait. Numbers for these can be unambiguously obtained from the reported data and provide surrogate measures for the tested baits and tested prey. The set of all pairs between viable bait and viable prey are the interactions that we are confident were experimentally tested and could, in principle, have been detected. The failure to detect an interaction between a viable bait and a viable prey is informative, whereas the absence of an observed interaction between an untested bait and prey is not. We note that the set of viable prey is a subset of the tested prey, and viable baits are a subset of the tested baits. This approach might introduce bias, because negative data from baits that were tested but found no prey, as well as from prey that were present but did not interact with any bait, are not recorded. On the other hand, presuming that combinations were tested, when in fact they were not, can also result in bias. Gilchrist et al. [29] used a randomization approach to estimate the size of the prey populations for the datasets in [7] and [8]. Their estimates are about double those of the number of viable prey.

Representation as graphs

Graph theory offers a convenient and useful set of terms and concepts to represent relationships between entities. Graphs most commonly represent binary relationships and these can be either directed or undirected. A further type of graph is needed to represent the membership of proteins in complexes: this relationship is not binary and requires a type of graph called a bipartite graph. Box 1 gives precise definitions of these concepts and an overview of how they apply to protein-interaction data.

Box 1

Undirected graphs are often used as a model for physical interactions. True relationships are symmetric: if protein A interacts with B, then B interacts with A. The observed experimental data, however, often display asymmetry, which is a consequence of the experimental asymmetry between bait and prey. Protein A may identify protein B as an interactor when A is used as a prey, but B as a prey may not find A. To represent asymmetric data, we suggest using a directed-graph model. This is a point on which we diverge from much of the current practice. We argue that although the quantity of interest is an unknown undirected graph, it must be estimated from the observed data, which should be represented as a directed graph. "All models are wrong, but some are useful." This maxim of George Box [30] reminds us that we should not expect these models to adequately represent all possible aspects of protein interactions in a satisfactory way. For the current types of data and questions, graph models are useful. As the data and the questions that we ask become more sophisticated, more complicated models are likely to be needed. Some limitations of the graph models described here are related to their lack of resolution in time and space, failure to distinguish between different protein isoforms or post-translational modifications, and to the fact that experiments do not record interactions between individual protein molecules but between populations. It is the lack of such information that makes it difficult to use Y2H data to make inference about the composition of protein complexes (see Figure 1) or to use AP-MS data to identify the physical interactions of the proteins within a complex and their stoichiometry (see Figure 2).

Error statistics

Whether two proteins physically interact in vivo is not always simple to determine: the range of binding affinities of biologically relevant protein interactions spans many orders of magnitude [31], and interactions can be dynamic, transient and highly regulated. Nevertheless, the simple measurement model used to interpret the results of protein-interaction experiments presumes that for each pair of proteins, the question of whether or not they interact can be answered as either yes or no. The aim of making a measurement is to record the true, typically unknown, value of a physical quantity, but in practice there will be deviations - measurement errors. In such circumstances, statistical methods can be used to infer the true value of a quantity, given the data and some assumptions about how the measurement tool works. In this sense, the Y2H system or an AP-MS screen are simply measurement tools that provide imperfect data from which we make inferences about the true state of nature. Standard definitions of various error statistics [32] are given in Box 2. We give them to enable a coherent dialog and to address some of the confusion in the literature. For example, a widely cited evaluation study by Edwards et al. [17] reported a "false positive rate" defined as FP/(TP + FP): where FP is the number of false positives and TP the number of true positives. However, the more common name for this quantity is the 'false-discovery rate' (see Box 2). The difference between the false-positive rate, as usually defined by FP/N, and the false-discovery rate can be substantial, as their denominators are very different, N being the true tested non-interactions, given by TN + FP (see Box 2). Incompatible terminology leads to confusion and makes comparison of error rates reported in different studies difficult.

Box 2

Measurement errors can be decomposed into two components: stochastic and systematic errors. Stochastic errors are associated with random variability, whereas systematic errors are recurrent. Stochastic errors are simpler to address: they can be controlled by replication, can be eventually eliminated if the experiment is repeated many times, and they can often readily be described using probability models. Systematic errors give rise to bias: the quantity being measured is consistently different from the truth. Their identification is difficult, but if it can be done, they can be addressed either by improving the experimental procedures or by developing appropriate methods for post-experiment data processing.

Statistical models for the analysis of protein-interaction data

Statistical models can integrate the information from repeated or related measurements and quantify the (un)certainty that we have about the conclusions. Here we consider how statistical techniques have been applied to two distinct problems: estimating membership of a protein complex and the integration of data from different experiments (cross-experiment integration of data).

Estimating membership of a protein complex

Russell and colleagues [10] have developed a heuristic that they term the 'socioaffinity index', A. It quantifies the confidence that proteins i and j share complex membership, given a set of protein purifications each with its bait and a number of prey. The score is the logarithm of the product of three odds-ratios. The first odds-ratio compares the frequency with which bait i pulled down prey j to the frequency that would be expected if prey came down randomly; the second is the corresponding value for bait j pulling down prey i; and the third is the ratio of frequency of co-occurrence of i and j in a pull-down to what would be expected under random sampling. The authors then apply a customized clustering algorithm to the matrix Ato estimate sets of protein complexes from AP-MS data. Scholtens and colleagues took a different route [33,34]. They explicitly modeled the underlying bipartite graph of membership of proteins in protein complexes. They estimated the bipartite graph from the observed data using a penalized likelihood method. Their method explicitly differentiates between tested and untested edges in the data, and it deals with the possibility that some proteins can be members of multiple complexes and others may not be assignable to any.

Cross-experiment integration of data

Turning to the issue of the cross-experiment integration of data, Gilchrist and colleagues [29] described a statistical model for identifying stochastic errors in protein-protein interaction datasets that is based on the Binomial distribution. They assumed that there is a true underlying graph of protein interactions in the biological system under study and that multiple experimental runs are performed, each resulting in a set of observed edges. A true edge is observed with probability 1 - pFN and missed with the false-negative probability pFN. Similarly, a true non-edge is observed as an edge with false-positive probability pFP and not observed with probability 1 - pFP. They assumed that all these stochastic events are independent of each other, and governed only by the two Binomial rates pFP and pFN. The statistical distribution of the number of observed edges S between two proteins, given ntrials, and conditional on whether or not they truly interact, is then simply given by Binomial distributions: (1) (2) From this, the authors constructed a maximum likelihood estimator of pFP and pFN, and a likelihood-ratio test to decide, for any pair of proteins, whether the data suggest an interaction between them. Krogan and colleagues [11,35] took an approach that is similar in spirit to that of Gilchrist et al. [29]. Their formulation uses a Bayes factor that compares the probability of the observed data under the two possible alternatives, and a further component that represents the prior odds of an interaction. The use of a Bayes factor in this context is entirely appropriate, but given that the selection of baits is typically not a simple random sample from the population of potential baits, it is somewhat difficult to interpret the role of the prior, and it seems some justification is needed. The two approaches [29,35] differ somewhat in how specific quantities, such as pFP and pFN, are estimated. An important difference is that Krogan and colleagues [35] were specifically interested in combining AP-MS datasets to solve the problem of identifying protein complexes.

Internal error rate estimation using reciprocity

The direction of an observed bait-prey interaction is informative for the estimation of error rates and the identification of systematic errors. If two proteins A and B are each tested both as bait and prey, then ideally we expect reciprocity in their interaction data: if they truly interact, bait A should find prey B and bait B should find prey A. If they truly do not interact, there should be no observed interaction in either direction. In real data there will be many pairs of proteins for which reciprocity does not hold, and these cases imply that either a false positive or a flase negative measurement was made. Comparing the prevalence of reciprocally measured interactions amung the reciprocally tested edges can tell us something about error rates, both stochastic and systematic. As the set of reciprocally tested edges is usually not explicitly recorded, we have used the concept of viable baits and viable prey to produce Table 1, which gives the numbers of viable bait and prey proteins, and based on this, the numbers of reciprocated and unreciprocated interaction measurements for several large-scale Y2H and AP-MS experiments. We can represent these data for each experiment as a directed subgraph GBP, with nodes being the intersection of viable baits and viable prey, and with directed edges each representing an observed interaction of a bait with a prey. There are several experiments in which GBP is sufficiently large for statisical analysis, and the usefulness of the reciprocity criterion can be used to measure the internal consistency of a dtaset [28].

Table 1

Overview of seven Y2H and five AP-MS experiments

Reference	VB	CB	TB	VP	VBP	VBP/BP	TI	TI/VB	REC	UNR
Ito et al. [1]	1,522		6,604	2,493	773	0.51	4,524	3.0	75	803
Cagney et al. [2]	19		31	40	11	0.58	54	2.9	3	4
Tong et al. [3]	20		22	59	5	0.25	115	5.8	1	1
Hazbun et al. [4]	66		100	1,940	28	0.42	2,524	38	4	13
Zhao et al. [5]	1		1	90	0	0.00	90	90	0	0
Uetz et al. Experiment 1 [6]	508		6,604	630	142	0.28	952	1.9	10	47
Uetz et al. Experiment 2 [6]	139		192	400	36	0.26	524	3.8	18	7
Gavin et al. [7]	455	600	725	1,179	271	0.60	3,419	7.5	192	314
Ho et al. [8]	493	589	1,739	1,316	231	0.47	3,687	7.5	69	297
Krogan et al. [9]	153	165	165	483	151	0.99	1,132	7.4	89	157
Gavin et al. [10]	1,752	1,993	6,466	1,790	991	0.57	19,105	10.9	1,077	4,297
Krogan et al. [11]	2,264	2,357	4,562	5,323	2,226	0.98	63,360	28.0	1,969	34,363

VB, the number of viable baits; CB, the number of cloned (hybridized) baits, if available; TB, the total number of baits that the experimenters were initially aiming at; VP, the number of viable prey; VBP, the number of proteins observed as both bait and prey; TI, the total number of interactions observed; REC, the number of reciprocated interactions between proteins that were observed as both bait and prey; UNR, the number of unreciprocated interactions between proteins that were observed as both bait and prey. Not all of the experiments were genome-wide - some were focused on particular aspects of the cellular machinery [2-5,9]. Even in the so-called genome-wide studies [1,6-8,10,11], however, the viable baits cover only around a third of the yeast genes. This means that the largest part of interaction space by far, containing interactions between proteins not used as baits, was not sampled in any of these experiments. We can also see that TI/VB, the average number of interactions per viable bait, varies markedly between experiments. In the more focused studies, this will certainly be a result of different criteria for the selection of baits. In the genome-wide screens it may indicate the application of different, experiment-specific cutoffs.

Overview of seven Y2H and five AP-MS experiments VB, the number of viable baits; CB, the number of cloned (hybridized) baits, if available; TB, the total number of baits that the experimenters were initially aiming at; VP, the number of viable prey; VBP, the number of proteins observed as both bait and prey; TI, the total number of interactions observed; REC, the number of reciprocated interactions between proteins that were observed as both bait and prey; UNR, the number of unreciprocated interactions between proteins that were observed as both bait and prey. Not all of the experiments were genome-wide - some were focused on particular aspects of the cellular machinery [2-5,9]. Even in the so-called genome-wide studies [1,6-8,10,11], however, the viable baits cover only around a third of the yeast genes. This means that the largest part of interaction space by far, containing interactions between proteins not used as baits, was not sampled in any of these experiments. We can also see that TI/VB, the average number of interactions per viable bait, varies markedly between experiments. In the more focused studies, this will certainly be a result of different criteria for the selection of baits. In the genome-wide screens it may indicate the application of different, experiment-specific cutoffs. To identify proteins that are likely to be subject to systematic experiemental error, we can compare their in-edges and out-edges (see Box 1) within the directed subgraph GBP. Ideally, theses edges should all reciprocate each other; if a certain protein has very many unreciprocated edges, this indicates that it is likely to be affected by a systematic error. To quantify this, the number of unreciprocated edges, nunr, originating from or pointing to a particular protein can be compared with the number of reciprocated edges that it has and to the false-positive and false-negative rates pFP and pFN. Precise estimation of these rates is difficult, however, and a simple and effective criterion can instead be derived from considering symmetry. For a given number of unreciprocated edges, nunr, if there are no systematic errors then the unreciprocated edges should be in-edges and out-edges in approximately equal numbers. If we denote their numbers by nin and nout, respectively, then nin + nout = nunr, and we expect that (3) If nin and nout are significantly different from each other, according to the Binomial distribution we would conclude that the protein behaved differently in the experiment when used as bait compared with prey, and would use this as an indication of systematic error affecting at least part of the data for that protein. An application of this criterion to the subgraph GBP of the data of Krogan et al. [11] is shown in Figure 3.

Figure 3

Scatterplot of nin and nout for the AP-MS data of Krogan et al. [11]. Each point in the plot corresponds to one protein. nis the number of times that the protein was found as a prey; nout the number of prey it found when used as a bait. The two lines mark contours of probability p = 10-4 according to the Binomial model in Equation (3). Outlying proteins (dark blue) show a significantly large difference between nin and nout, suggesting that at least one of them is wrong. For example, if nout >>nin, one possible reason is that a protein is not expressed when used as prey or of such low abundance that it is outcompeted, but when tagged and expressed as a bait, it will identify and pull down its interaction partners as prey. Further validation experiments are needed to determine in each case whether the unreciprocated interactions correspond to false-positive or false-negative observations.

Estimation of the properties of the interaction graph in this setting

There are two basic approaches to estimation: one is to estimate the true underlying graph, given the data and some modeling assumptions, then to calculate properties of interest from the estimated graph. The other is to directly estimate the quantities of interest without making an attempt to estimate the true underlying graph. For protein-interaction data we suggest that the latter is often preferable, as it can deal better with the low coverage of the datasets. As new methods and models for integrating datasets are developed it will be important to reassess the situation. We distinguish between two different types of quantities to be estimated. The first type are single numeric values, such as degree, clustering coefficient or diameter. The second are more general structures, such as modules or subgraphs. The tools for estimation are more developed for numeric quantities than for modules, and there is agreement on the definitions of the different quantities. For modules, or cohesive subgroups, there is little agreement on what is being sought or how to find it.

The integration of data from different independent experiments

No single experiment has provided complete information on all interactions in a system of interest and so data from different experiments need to be integrated. Integration promises to increase coverage and reduce the effects of stochastic errors. Table 1 summarizes experiments done on the yeast protein interactome that are candidates for integration. The overlap between experiments is examined in Tables 2 and 3.

Table 2

Pairwise comparison of Y2H datasets

References	Ito et al. [1]	Cagney et al. [2]	Tong et al. [3]	Hazbun et al. [4]	Zhao et al. [5]	Uetz et al. [6]Experiment 1	Uetz et al. [6]Experiment 2
[1]	-	9	7	24	1	224	47
[2]	28	-	0	0	0	7	3
[3]	34	0	-	0	0	4	7
[4]	856	14	25	-	0	15	12
[5]	43	1	2	38	-	0	0
[6] Experiment 1	388	14	22	272	15	-	36
[6] Experiment 2	200	9	26	204	13	108	-

The values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. We see that the overlap between experiments in the sampled fractions of protein- interaction space is in all cases very small, given that thousands of interactions were assayed.

Table 3

Pairwise comparison of AP-MS datasets

References	Gavin et al. [7]	Ho et al. [8]	Krogan et al. [9]	Gavin et al. [10]	Krogan et al. [11]
[7]	-	82	51	442	334
[8]	516	-	25	222	286
[9]	299	246	-	121	151
[10]	1,143	717	371	-	1,128
[11]	1,149	1,277	478	1,732	-

As in Table 2 the values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. Again, the overlap is very small. Consider the two largest experiments carried out so far: with a set of 2,264 viable baits and 5,323 viable prey, Krogan et al. [11] tested for the presence of at least 12 million complex membership interactions. Gavin et al. [10], with 1,752 viable baits and 1,790 viable prey, tested for at least 3.1 million interactions. However, even for these two datasets, the largest so far, the known overlap is only 1,128 × 1,732 ≈ 2.0 million. One of the possible explanations for these low estimates of coverage and overlap is that our definitions of viable baits and viable prey are restrictive and that indeed a much larger space of interactions might have been tested. For example, Gilchrist et al. [29] estimated a value about twice ours for the number of tested prey in [7]. This situation will hopefully be alleviated as researchers report more complete data on which interactions were actually tested.

Pairwise comparison of Y2H datasets The values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. We see that the overlap between experiments in the sampled fractions of protein- interaction space is in all cases very small, given that thousands of interactions were assayed. Pairwise comparison of AP-MS datasets As in Table 2 the values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. Again, the overlap is very small. Consider the two largest experiments carried out so far: with a set of 2,264 viable baits and 5,323 viable prey, Krogan et al. [11] tested for the presence of at least 12 million complex membership interactions. Gavin et al. [10], with 1,752 viable baits and 1,790 viable prey, tested for at least 3.1 million interactions. However, even for these two datasets, the largest so far, the known overlap is only 1,128 × 1,732 ≈ 2.0 million. One of the possible explanations for these low estimates of coverage and overlap is that our definitions of viable baits and viable prey are restrictive and that indeed a much larger space of interactions might have been tested. For example, Gilchrist et al. [29] estimated a value about twice ours for the number of tested prey in [7]. This situation will hopefully be alleviated as researchers report more complete data on which interactions were actually tested. An essential step before integration of data is to assess their quality in terms of specificity, sensitivity and coverage. Such an assessment should provide reliable estimates of the false-positive and false-negative error rates. There are three main computational approaches: comparison to a benchmark or 'gold standard' data, within-experiment or internal validation, and between-experiment validation. When direct physical interactions are being measured (for example, by Y2H), crystal structures of the interacting proteins can be used as the gold standard for the validity of the interaction. This was one of the approaches used in [17]. Only a handful of crystal structures of interacting proteins are known, however, and such data are still difficult and expensive to obtain. Some physical interactions and protein complexes have also been characterized through detailed biochemical investigations, and are collected in databases such as MIPS [36] and GO [37]. Circularity needs to be avoided, however; for example, the data from [7] and [9] are now reported as known complexes in some of the public protein complex databases. Within-experiment validation relies on internal properties of the data, such as redundancies or symmetries that are not used in the experiment, and that can therefore be used to validate the experimental results. One such property is reciprocity, as discussed above. Deviations from expectation can be used to estimate stochastic error rates, and they can also be used to identify individual proteins whose data appear to be subject to systematic artifacts (see Figure 3). Reported replicate measurements can also be used to help validate experimental data and to estimate error rates. The basic idea is that if edges are tested multiple times under the same conditions, those that are found frequently can be termed true positives and can be used to estimate the false-negative rate from those cases when they were missed. Similarly, those that are seldom found can be deemed true negatives, and from the positive data points the false-positive rate can be estimated. This approach is complicated by possible dependencies between the replicate measurements and by systematic errors that, if present, will affect all replicates. These complications may render the statistical model intractable. Further caution is warranted. Was the choice of replicates measures made a priori or because of anomalous results obtained during the experiment? Do they provide equal coverage of all important conditions and of all types of proteins that were studied? Between-experiment comparisons rely on the experimental conditions being sufficiently similar to ensure that the measurements are made on the same underlying set of true interactions. However, as we see in Tables 2 and 3, in many cases there is relatively little overlap in bait selection and in observed prey. For two recent experiments with at least some overlap, a comparison was presented by [20]. These authors found a moderate overlap between the primary data, for example the proteins identified by each successful bait, but a low everlap of the computed protein complexes by each group. When integrating data from different experiments our recommendation is that validation to a gold standard and within-experiment validation should first be done on each experiment separately. Once the data are sufficiently well understood and as many of the systematic errors as possible have been resolved, integration becomes worthwhile. If there is little agreement on the existence of interactions for edges tested in different experiments, then one must question the prudence of their integration: it may be that the biological conditions were too different to allow their integration into a single meaningful dataset. There is room for much more research here. Evidence in favor of, or against, experimentally detected interactions can often be obtained from other sources, such as data from other organisms, dependencies of different types of interactions on each other (for example, coexpression, co-localization and physical interaction), evolutionary conservation [38], protein structure [39] and amino-acid binding motifs [40]. The challenge is to ensure that the evidence is applicable and that it does bear relationship to the assay and system under study. Our purpose in writing this article was to address the observation that the many different protein-interaction datasets available appear to have very little in common, and also to address reports that the data were inherently noisy and of low quality (for example [17,41]). Our investigations suggest that the data themselves, while problematic in some cases, are not the real issue, but rather there is often misinterpretation of the data, methods to address noisiness are often inadequate, and the lack of substantive comparisons between methods applied to the data has led to a situation where the data, rather than the methods, are treated with suspicion. As seen from Tables 2 and 3, low coverage, and not the false-positive rate, is responsible for the small amount of overlap between datasets. The separation of errors into stochastic and systematic components is potentially of great benefit. Comparison of experimental data should be based on stochastic error rates. The identification of systematic errors can help to identify problems with the experimental techniques and hopefully suggest solutions to those problems. We believe that when more standard, and sound, statistical practices are adopted for preprocessing the data, it will be possible to estimate quantities of interest and to make substantial comparisons. An essential prerequisite is the adoption of standard methods for estimation of stochastic error rates and where possible the identification of systematic errors. Standardized preprocessing is also required in order to be able to synthesize different experimental datasets. Combining data requires attention to the differing error rates, and the discounting of information from more variable experiments. Given the numbers in Tables 2 and 3, there is much to be gained by combining the different experimental datasets. We believe that the data, while noisy, are in fact very useful, and with appropriate preprocessing and statistical modeling they can provide deep insight into the functioning of cellular machineries.

39 in total

1. Structure-based assembly of protein complexes in yeast.

Authors: Patrick Aloy; Bettina Böttcher; Hugo Ceulemans; Christina Leutwein; Christian Mellwig; Susanne Fischer; Anne-Claude Gavin; Peer Bork; Giulio Superti-Furga; Luis Serrano; Robert B Russell
Journal: Science Date: 2004-03-26 Impact factor: 47.728

2. Local modeling of global interactome networks.

Authors: Denise Scholtens; Marc Vidal; Robert Gentleman
Journal: Bioinformatics Date: 2005-07-05 Impact factor: 6.937

3. Functional organization of the yeast proteome by systematic analysis of protein complexes.

Authors: Anne-Claude Gavin; Markus Bösche; Roland Krause; Paola Grandi; Martina Marzioch; Andreas Bauer; Jörg Schultz; Jens M Rick; Anne-Marie Michon; Cristina-Maria Cruciat; Marita Remor; Christian Höfert; Malgorzata Schelder; Miro Brajenovic; Heinz Ruffner; Alejandro Merino; Karin Klein; Manuela Hudak; David Dickson; Tatjana Rudi; Volker Gnau; Angela Bauch; Sonja Bastuck; Bettina Huhse; Christina Leutwein; Marie-Anne Heurtier; Richard R Copley; Angela Edelmann; Erich Querfurth; Vladimir Rybin; Gerard Drewes; Manfred Raida; Tewis Bouwmeester; Peer Bork; Bertrand Seraphin; Bernhard Kuster; Gitte Neubauer; Giulio Superti-Furga
Journal: Nature Date: 2002-01-10 Impact factor: 49.962

4. A novel genetic system to detect protein-protein interactions.

Authors: S Fields; O Song
Journal: Nature Date: 1989-07-20 Impact factor: 49.962

5. A comprehensive two-hybrid analysis to explore the yeast protein interactome.

Authors: T Ito; T Chiba; R Ozawa; M Yoshida; M Hattori; Y Sakaki
Journal: Proc Natl Acad Sci U S A Date: 2001-03-13 Impact factor: 11.205

6. High-definition macromolecular composition of yeast RNA-processing complexes.

Authors: Nevan J Krogan; Wen-Tao Peng; Gerard Cagney; Mark D Robinson; Robin Haw; Gouqing Zhong; Xinghua Guo; Xin Zhang; Veronica Canadien; Dawn P Richards; Bryan K Beattie; Atanas Lalev; Wen Zhang; Armaity P Davierwala; Sanie Mnaimneh; Andrei Starostine; Aaron P Tikuisis; Jorg Grigull; Nira Datta; James E Bray; Timothy R Hughes; Andrew Emili; Jack F Greenblatt
Journal: Mol Cell Date: 2004-01-30 Impact factor: 17.970

7. A map of the interactome network of the metazoan C. elegans.

Authors: Siming Li; Christopher M Armstrong; Nicolas Bertin; Hui Ge; Stuart Milstein; Mike Boxem; Pierre-Olivier Vidalain; Jing-Dong J Han; Alban Chesneau; Tong Hao; Debra S Goldberg; Ning Li; Monica Martinez; Jean-François Rual; Philippe Lamesch; Lai Xu; Muneesh Tewari; Sharyl L Wong; Lan V Zhang; Gabriel F Berriz; Laurent Jacotot; Philippe Vaglio; Jérôme Reboul; Tomoko Hirozane-Kishikawa; Qianru Li; Harrison W Gabel; Ahmed Elewa; Bridget Baumgartner; Debra J Rose; Haiyuan Yu; Stephanie Bosak; Reynaldo Sequerra; Andrew Fraser; Susan E Mango; William M Saxton; Susan Strome; Sander Van Den Heuvel; Fabio Piano; Jean Vandenhaute; Claude Sardet; Mark Gerstein; Lynn Doucette-Stamm; Kristin C Gunsalus; J Wade Harper; Michael E Cusick; Frederick P Roth; David E Hill; Marc Vidal
Journal: Science Date: 2004-01-02 Impact factor: 47.728

8. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

Authors: Nevan J Krogan; Gerard Cagney; Haiyuan Yu; Gouqing Zhong; Xinghua Guo; Alexandr Ignatchenko; Joyce Li; Shuye Pu; Nira Datta; Aaron P Tikuisis; Thanuja Punna; José M Peregrín-Alvarez; Michael Shales; Xin Zhang; Michael Davey; Mark D Robinson; Alberto Paccanaro; James E Bray; Anthony Sheung; Bryan Beattie; Dawn P Richards; Veronica Canadien; Atanas Lalev; Frank Mena; Peter Wong; Andrei Starostine; Myra M Canete; James Vlasblom; Samuel Wu; Chris Orsi; Sean R Collins; Shamanta Chandran; Robin Haw; Jennifer J Rilstone; Kiran Gandi; Natalie J Thompson; Gabe Musso; Peter St Onge; Shaun Ghanny; Mandy H Y Lam; Gareth Butland; Amin M Altaf-Ul; Shigehiko Kanaya; Ali Shilatifard; Erin O'Shea; Jonathan S Weissman; C James Ingles; Timothy R Hughes; John Parkinson; Mark Gerstein; Shoshana J Wodak; Andrew Emili; Jack F Greenblatt
Journal: Nature Date: 2006-03-22 Impact factor: 49.962

9. Assigning function to yeast proteins by integration of technologies.

Authors: Tony R Hazbun; Lars Malmström; Scott Anderson; Beth J Graczyk; Bethany Fox; Michael Riffle; Bryan A Sundin; J Derringer Aranda; W Hayes McDonald; Chun-Hwei Chiu; Brian E Snydsman; Phillip Bradley; Eric G D Muller; Stanley Fields; David Baker; John R Yates; Trisha N Davis
Journal: Mol Cell Date: 2003-12 Impact factor: 17.970

10. How complete are current yeast and human protein-interaction networks?

Authors: G Traver Hart; Arun K Ramani; Edward M Marcotte
Journal: Genome Biol Date: 2006 Impact factor: 13.583

18 in total

1. Categorizing biases in high-confidence high-throughput protein-protein interaction data sets.

Authors: Xueping Yu; Joseph Ivanic; Vesna Memisević; Anders Wallqvist; Jaques Reifman
Journal: Mol Cell Proteomics Date: 2011-08-29 Impact factor: 5.911

2. From evidence to inference: probing the evolution of protein interaction networks.

Authors: Oliver Ratmann; Carsten Wiuf; John W Pinney
Journal: HFSP J Date: 2009-10-19

3. Computational approaches for detecting protein complexes from protein interaction networks: a survey.

Authors: Xiaoli Li; Min Wu; Chee-Keong Kwoh; See-Kiong Ng
Journal: BMC Genomics Date: 2010-02-10 Impact factor: 3.969

4. Rich can get poor: conversion of hub to non-hub proteins.

Authors: Kyaw Tun; Raghuraj Keshava Rao; Lakshminarayanan Samavedham; Hiroshi Tanaka; Pawan K Dhar
Journal: Syst Synth Biol Date: 2009-04-28

5. Next-generation sequencing coupled with a cell-free display technology for high-throughput production of reliable interactome data.

Authors: Shigeo Fujimori; Naoya Hirai; Hiroyuki Ohashi; Kazuyo Masuoka; Akihiko Nishikimi; Yoshinori Fukui; Takanori Washio; Tomohiro Oshikubo; Tatsuhiro Yamashita; Etsuko Miyamoto-Sato
Journal: Sci Rep Date: 2012-10-09 Impact factor: 4.379

6. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology.

Authors: Elizabeth J Rossin; Kasper Lage; Soumya Raychaudhuri; Ramnik J Xavier; Diana Tatar; Yair Benita; Chris Cotsapas; Mark J Daly
Journal: PLoS Genet Date: 2011-01-13 Impact factor: 5.917

7. Integrating diverse biological and computational sources for reliable protein-protein interactions.

Authors: Min Wu; Xiaoli Li; Hon Nian Chua; Chee-Keong Kwoh; See-Kiong Ng
Journal: BMC Bioinformatics Date: 2010-10-15 Impact factor: 3.169

8. Assembling a protein-protein interaction map of the SSU processome from existing datasets.

Authors: Young H Lim; J Michael Charette; Susan J Baserga
Journal: PLoS One Date: 2011-03-10 Impact factor: 3.240

9. Literature-curated protein interaction datasets.

Authors: Michael E Cusick; Haiyuan Yu; Alex Smolyar; Kavitha Venkatesan; Anne-Ruxandra Carvunis; Nicolas Simonis; Jean-François Rual; Heather Borick; Pascal Braun; Matija Dreze; Jean Vandenhaute; Mary Galli; Junshi Yazaki; David E Hill; Joseph R Ecker; Frederick P Roth; Marc Vidal
Journal: Nat Methods Date: 2009-01 Impact factor: 28.547

10. Construction of a highly flexible and comprehensive gene collection representing the ORFeome of the human pathogen Chlamydia pneumoniae.

Authors: Christina J Maier; Richard H Maier; Dezso Peter Virok; Matthias Maass; Helmut Hintner; Johann W Bauer; Kamil Onder
Journal: BMC Genomics Date: 2012-11-16 Impact factor: 3.969