| Literature DB >> 18001486 |
Robert Gentleman1, Wolfgang Huber.
Abstract
We review the estimation of coverage and error rate in high-throughput protein-protein interaction datasets and argue that reports of the low quality of such data are to a substantial extent based on misinterpretations. Probabilistic statistical models and methods can be used to estimate properties of interest and to make the best use of the available data.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18001486 PMCID: PMC2246275 DOI: 10.1186/gb-2007-8-10-112
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Interpreting results on direct physical interactions from Y2H experiments. (a) The observation of interactions A-B and B-C in a Y2H experiment does not indicate whether the two interactions can take place simultaneously (center) or whether they are exclusive of each other (right). (b) The ability of two proteins to interact may depend on post-translational modifications whose presence or absence may be actively regulated. Proteins D and E interact (center) in the absence of a certain post-translational modification (red shape), whose presence inhibits the interaction (right).
Figure 2The manifestation of protein complexes in Y2H and AP-MS data. AP-MS experiments measure complex co-membership, and the fact that a prey is found by a certain bait means that there is either a direct physical interaction or an indirect physical interaction mediated by a protein complex. The set of proteins pulled down by a particular bait cannot therefore be equated with a single complex: if the bait is part of several different complexes, then the set of prey will be the union of all proteins in all complexes. (a) Protein B is involved in three different multiprotein complexes. In two of these it directly interacts with C, which itself can also interact with proteins F, G or H, whereas in the third complex, B interacts with D and E. (b) Assuming there are no other interactions under the conditions of the experiment, the bipartite graph between proteins B, ... H and complexes 1, 2, and 3 will look like this. (c,d) The result of a hypothetical AP-MS experiment with no false positives and no false negatives when (c) B is used as a bait and (e) F is used as a bait. (e,f) Result from a hypothetical Y2H experiment with a genome-wide set of preys and with no false positives and false negatives when (d) B is used as a bait and (f) F is used as a bait. (g,h) The results of (g) an ideal AP-MS experiment and (h) an ideal Y2H experiment if all proteins were used as baits. The Y2H data in (e,f,h) identifies the direct interactions, but it does not contain information on the number and architecture of the complexes. The maximal cliques identified by the AP-MS experiment in (g) correspond to the complexes in (a). However, the AP-MS data do not contain information on the topology of the direct interactions within each complex.
Box 1
Box 2Overview of seven Y2H and five AP-MS experiments
| Reference | VB | CB | TB | VP | VBP | VBP/BP | TI | TI/VB | REC | UNR |
| Ito | 1,522 | 6,604 | 2,493 | 773 | 0.51 | 4,524 | 3.0 | 75 | 803 | |
| Cagney | 19 | 31 | 40 | 11 | 0.58 | 54 | 2.9 | 3 | 4 | |
| Tong | 20 | 22 | 59 | 5 | 0.25 | 115 | 5.8 | 1 | 1 | |
| Hazbun | 66 | 100 | 1,940 | 28 | 0.42 | 2,524 | 38 | 4 | 13 | |
| Zhao | 1 | 1 | 90 | 0 | 0.00 | 90 | 90 | 0 | 0 | |
| Uetz | 508 | 6,604 | 630 | 142 | 0.28 | 952 | 1.9 | 10 | 47 | |
| Uetz | 139 | 192 | 400 | 36 | 0.26 | 524 | 3.8 | 18 | 7 | |
| Gavin | 455 | 600 | 725 | 1,179 | 271 | 0.60 | 3,419 | 7.5 | 192 | 314 |
| Ho | 493 | 589 | 1,739 | 1,316 | 231 | 0.47 | 3,687 | 7.5 | 69 | 297 |
| Krogan | 153 | 165 | 165 | 483 | 151 | 0.99 | 1,132 | 7.4 | 89 | 157 |
| Gavin | 1,752 | 1,993 | 6,466 | 1,790 | 991 | 0.57 | 19,105 | 10.9 | 1,077 | 4,297 |
| Krogan | 2,264 | 2,357 | 4,562 | 5,323 | 2,226 | 0.98 | 63,360 | 28.0 | 1,969 | 34,363 |
VB, the number of viable baits; CB, the number of cloned (hybridized) baits, if available; TB, the total number of baits that the experimenters were initially aiming at; VP, the number of viable prey; VBP, the number of proteins observed as both bait and prey; TI, the total number of interactions observed; REC, the number of reciprocated interactions between proteins that were observed as both bait and prey; UNR, the number of unreciprocated interactions between proteins that were observed as both bait and prey. Not all of the experiments were genome-wide - some were focused on particular aspects of the cellular machinery [2-5,9]. Even in the so-called genome-wide studies [1,6-8,10,11], however, the viable baits cover only around a third of the yeast genes. This means that the largest part of interaction space by far, containing interactions between proteins not used as baits, was not sampled in any of these experiments. We can also see that TI/VB, the average number of interactions per viable bait, varies markedly between experiments. In the more focused studies, this will certainly be a result of different criteria for the selection of baits. In the genome-wide screens it may indicate the application of different, experiment-specific cutoffs.
Figure 3Scatterplot of nin and nout for the AP-MS data of Krogan et al. [11]. Each point in the plot corresponds to one protein. nis the number of times that the protein was found as a prey; nout the number of prey it found when used as a bait. The two lines mark contours of probability p = 10-4 according to the Binomial model in Equation (3). Outlying proteins (dark blue) show a significantly large difference between nin and nout, suggesting that at least one of them is wrong. For example, if nout >>nin, one possible reason is that a protein is not expressed when used as prey or of such low abundance that it is outcompeted, but when tagged and expressed as a bait, it will identify and pull down its interaction partners as prey. Further validation experiments are needed to determine in each case whether the unreciprocated interactions correspond to false-positive or false-negative observations.
Pairwise comparison of Y2H datasets
| References | Ito | Cagney | Tong | Hazbun | Zhao | Uetz | Uetz |
| [1] | - | 9 | 7 | 24 | 1 | 224 | 47 |
| [2] | 28 | - | 0 | 0 | 0 | 7 | 3 |
| [3] | 34 | 0 | - | 0 | 0 | 4 | 7 |
| [4] | 856 | 14 | 25 | - | 0 | 15 | 12 |
| [5] | 43 | 1 | 2 | 38 | - | 0 | 0 |
| [6] Experiment 1 | 388 | 14 | 22 | 272 | 15 | - | 36 |
| [6] Experiment 2 | 200 | 9 | 26 | 204 | 13 | 108 | - |
The values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. We see that the overlap between experiments in the sampled fractions of protein- interaction space is in all cases very small, given that thousands of interactions were assayed.
Pairwise comparison of AP-MS datasets
| References | Gavin | Ho | Krogan | Gavin | Krogan |
| [7] | - | 82 | 51 | 442 | 334 |
| [8] | 516 | - | 25 | 222 | 286 |
| [9] | 299 | 246 | - | 121 | 151 |
| [10] | 1,143 | 717 | 371 | - | 1,128 |
| [11] | 1,149 | 1,277 | 478 | 1,732 | - |
As in Table 2 the values above the diagonal give the number of viable baits in common between each pair of experiments, and the values below the diagonal give the number of viable prey in common. Again, the overlap is very small. Consider the two largest experiments carried out so far: with a set of 2,264 viable baits and 5,323 viable prey, Krogan et al. [11] tested for the presence of at least 12 million complex membership interactions. Gavin et al. [10], with 1,752 viable baits and 1,790 viable prey, tested for at least 3.1 million interactions. However, even for these two datasets, the largest so far, the known overlap is only 1,128 × 1,732 ≈ 2.0 million. One of the possible explanations for these low estimates of coverage and overlap is that our definitions of viable baits and viable prey are restrictive and that indeed a much larger space of interactions might have been tested. For example, Gilchrist et al. [29] estimated a value about twice ours for the number of tested prey in [7]. This situation will hopefully be alleviated as researchers report more complete data on which interactions were actually tested.