| Literature DB >> 15740614 |
Ruth Dunn1, Frank Dudbridge, Christopher M Sanderson.
Abstract
BACKGROUND: This paper describes an automated method for finding clusters of interconnected proteins in protein interaction networks and retrieving protein annotations associated with these clusters.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15740614 PMCID: PMC555937 DOI: 10.1186/1471-2105-6-39
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets used for analysis Numbers of nodes and edges in each of the datasets used and a brief description of the methods used to generate the datasets.
| Name | Nodes | Edges | Description and Reference |
| Gavin | 1343 | 3145 | Mass screen of yeast protein complexes using affinity purifica tion [15] (Note 1) |
| Ito | 3271 | 4469 | Mass screen of yeast protein interactions using Y2H [18] (Note1) |
| Lehner | 329 | 406 | Y2H interactions between |
| Uetz | 1358 | 1498 | Mass screen of yeast protein interactions using Y2H [14] (Note 1). |
Notes: 1. The Gavin, Ito and Uetz graphs were all generated from BIND [28] derived datasets, which had GO annotations added and were supplied with v0.9.1 of the 'Osprey' graph visualisation tool [29,30]
2. The Lehner dataset is a combined set of the data from the two cited papers. These data are available in both IntAct [31] (experiment references EBI-348647, EBI-368082 and EBI-368083) and BIND [28] (refs 130691–130793 and 153087–153089)
range of cluster sizes The distribution of cluster sizes in 3 datasets, after clustering with different numbers of edges removed.
| Dataset | Number Edges Removed | Edges Re-moved % | Nodes per Cluster | |||||
| 1 | 2–5 | 6–20 | 21–50 | 51–200 | 201+ | |||
| Number of Clusters in Size Range | ||||||||
| Uetz | 30 | 2% | 13 | 128 | 9 | 3 | 0 | 1 |
| Uetz | 57 | 4% | 13 | 128 | 9 | 3 | 1 | 1 |
| Uetz | 100 | 7% | 13 | 128 | 11 | 5 | 4 | 1 |
| Uetz | 200 | 13% | 13 | 130 | 32 | 19 | 1 | 0 |
| Uetz | 400 | 27% | 21 | 256 | 71 | 0 | 0 | 0 |
| Gavin | 57 | 1.5% | 0 | 33 | 8 | 2 | 0 | 1 |
| Gavin | 400 | 15% | 0 | 33 | 16 | 4 | 3 | 2 |
| Gavin | 800 | 25% | 4 | 58 | 57 | 15 | 2 | 0 |
| Gavin | 1500 | 50% | 263 | 154 | 67 | 1 | 0 | 0 |
| Lehner | 15 | 4% | 1 | 6 | 5 | 2 | 1 | 0 |
| Lehner | 30 | 7% | 1 | 6 | 7 | 3 | 1 | 0 |
| Lehner | 57 | 14% | 1 | 6 | 10 | 4 | 1 | 0 |
| Lehner | 100 | 25% | 4 | 15 | 23 | 0 | 0 | 0 |
cluster characteristics The average cluster size, number of clusters and other properties of the dataset, after clustering with different numbers of edges removed.
| Dataset | Number of Edges Removed | Edges Removed % | Number of clusters size > 1 | Average Cluster Size | Biggest cluster(%) | Single Nodes(%) |
| Uetz | 30 | 2% | 141 | 9.5 | 849(61%) | 13(1 %) |
| Uetz | 57 | 4% | 142 | 9.5 | 715(53%) | 13(1 %) |
| Uetz | 100 | 7% | 149 | 9.0 | 459(38%) | 13(1 %) |
| Uetz | 200 | 13% | 182 | 7.4 | 53(4 %) | 13(1 %) |
| Uetz | 400 | 27% | 327 | 4.1 | 13(1 %) | 21(1.5%) |
| Gavin | 57 | 1.5% | 44 | 30.5 | 1106(82%) | 0(0 %) |
| Gavin | 400 | 15% | 58 | 23.1 | 360(27%) | 0(0 %) |
| Gavin | 800 | 25% | 132 | 10.1 | 56(4 %) | 4(0.3%) |
| Gavin | 1500 | 50% | 222 | 4.9 | 23(2 %) | 263(19 %) |
| Lehner | 15 | 4% | 14 | 23.4 | 190(58%) | 1(0.3%) |
| Lehner | 30 | 7% | 17 | 19.3 | 143(43%) | 1(0.3%) |
| Lehner | 57 | 14% | 21 | 15.6 | 60(18%) | 1(0.3%) |
| Lehner | 100 | 25% | 38 | 8.6 | 19(6 %) | 2(0.6%) |
cluster quality Association between the size of the clusters and the quality and quantity of significant GO terms with different numbers of edges removed.
| Dataset | Number of Edges Removed | Edges Removed % | GO per Cluster | GO per Node | Depth of GO per Node | Number of Clusters with no significant annotation |
| Uetz | 30 | 2% | 0.7 | 0.1 | 4.9 | 120 |
| Uetz | 57 | 4% | 0.8 | 0.1 | 4.9 | 120 |
| Uetz | 100 | 7% | 0.9 | 0.1 | 4.9 | 121 |
| Uetz | 200 | 13% | 1.1 | 0.2 | 4.8 | 137 |
| Uetz | 400 | 27% | 3.5 | 0.7 | 4.5 | 261 |
| Gavin | 57 | 1.5% | 2.1 | 0.1 | 4.6 | 23 |
| Gavin | 400 | 15% | 3.4 | 0.2 | 4.7 | 24 |
| Gavin | 800 | 25% | 2.8 | 0.3 | 4.6 | 59 |
| Gavin | 1500 | 50% | 2.2 | 0.5 | 4.6 | 336 |
| Lehner | 15 | 4% | 22.9 | 1.0 | 5.8 | 1 |
| Lehner | 30 | 7% | 21.3 | 1.1 | 5.8 | 1 |
| Lehner | 57 | 14% | 19.2 | 1.2 | 5.8 | 1 |
| Lehner | 100 | 25% | 15.5 | 1.8 | 5.8 | 2 |
significant GO terms for the Lehner dataset A selection of GO terms with significant correlations to the 20 clusters in the Lehner dataset, clustered by removing 57 edges. (The numbers after the descriptions show the proportion of proteins in the cluster which were annotated with that GO term). The complete set of GO terms for each of these clusters can be seen in Additional file 7 and the identity of the transcripts associated with the significant GO terms can be found in Additional file 8.
| Cluster Number | Size of Cluster | Significant GO descriptions |
| 15 | 20 | ubiquitination 4/20 |
| 4 | 49 | protein biosynthesis 7/49, RNA catabolism 4/49, translation 3/49 |
| 19 | 3 | ubiquitin 1/3, cell defence 1/3 |
| 8 | 24 | electron transport 2/24 |
| 11 | 10 | transcription regulation3/10 |
| 16 | 22 | transport 6/22, glucose catabolism 2/22 |
| 18 | 7 | DNA repair 2/7 |
| 3 | 60 | RNA splicing 14/60, spliceosome 5/60 |
| 7 | 10 | ribosome assembly 2/10, cytoplasmic exosome 1/10 |
| 12 | 8 | protein metabolism 3/8, phosphorylation 3/8 |
| 22 | 1 | morphogenesis 1/2, membrane 1/2 |
| 2 | 19 | signal transduction 4/19, ER 3/19 |
| 9 | 4 | transcription reg 1/4 |
| 21 | 4 | mRNA catabolism 1/4 |
| 6 | 12 | mRNA export 1/12, DNA binding 4/12 |
| 1 | 18 | cytoskeleton 3/18 |
| 20 | 2 | ATP biosynthesis 2/2 |
| 14 | 14 | DNA replication 2/14, cell cycle 3/14 |
| 10 | 23 | biological process 8/23 oncogenesis 2/23 |
Clustering of RNA splicing proteins in the Lehner dataset with different numbers of edges removed.
| edges removed | size of 'RNA splicing' cluster | proportion of proteins annotated for 'RNA splicing' | proportion of proteins which were prey of Lsm proteins in [13] |
| 15 | 190 | 18/190 | 51/190 |
| 30 | 143 | 17/143 | 49/143 |
| 57 | 60 | 14/60 | 49/60 |
| 100 | 17 | 10/17 | 14/17 |
the distribution of proteins associated with RNA metabolism from TAP-C128 The number of proteins from TAP-C128 [15] which cluster together when different numbers of edges are removed and also the proportions which are annotated for RNA metabolism.
| Edges removed | Number of clusters associated with the GO term 'RNA metabolism' | Largest group of TAP-C128 found together | Proportion of proteins * | Number of Lsm proteins in this cluster |
| 57 | 3 clusters | 36/36 | 128/1106 | 7/7 |
| 400 | 7 clusters | 27/36 | 57/142 | 7/7 |
| 800 | 8 clusters | 13/36 | 43/56 | 7/7 |
| 1500 | 11 clusters | 7/36 | 5/7 | 6/7 |
*Proportion of proteins in the cluster containing most TAP-C128 proteins which were associated with RNA metabolism
the distribution of affinity purified proteins from TAP C-162 TAP C-162 [15] is an mRNA polyadenylation complex of 36 proteins, thought to be a stable complex
| Edges Removed | Number of Clusters Containing TAP C-162 proteins | Numbers of the TAP C-162 proteins in each of the Clusters |
| 57 | 1 | (36) |
| 400 | 5 | (25,7, 4 × 1) |
| 800 | 9 | (23,4, 9 × 1) |
| 1500 | 16 | (16,22, 16 × 1) |
the distribution of affinity purified proteins from TAP C-151 TAP C-151 [15] is a signaling protein complex of 45 proteins, thought to be more labile than TAP C-162
| Edges Removed | Number of Clusters Containing TAP C-151 proteins | Numbers of the TAP C-151 proteins in each of the Clusters |
| 57 | 1 | (45) |
| 400 | 3 | (43,1,1) |
| 800 | 11 | (14,12,7,3,2,2, 5 × 1) |
| 1500 | 21 | (9,7,6,4,3,2, 14 × 1) |
datasets used to investigate false positives These datasets were used to investigate the effect of false positive edges on the clustering of the datasets
| Name | Nodes | Edges |
| Lehner original | 329 | 406 |
| Lehner plus False Positive proteins and edges | 353 | 465 |
| Lehner with false positive proteins' edges disconnected | 353 | 397 |
| Ito [18] | 3271 | 4469 |
cluster size distribution with and without false positives
| Dataset | Number Edges Removed | Edges Removed % | Nodes per Cluster | |||||
| 1 | 2–5 | 6–20 | 21–50 | 51–200 | 201+ | |||
| Number of Clusters in Size Range | ||||||||
| Lehner | 57 | 14% | 1 | 6 | 10 | 4 | 1 | 0 |
| Lehner plus False Positive Edges(FPE) | 57 | 12.3% | 1 | 6 | 11 | 2 | 2 | 0 |
| Lehner minus 68 FPE | 57+68* | 26.9% | 32 | 6 | 9 | 5 | 1 | 0 |
| Lehner random edges removed** | 57+68* | 26.9% | 39.7 ± 3.2 | 6.5 ± 1.0 | 13.4 ± 2.1 | 4.1 ± 1.1 | 0.05 ± 0.2 | 0 ± 0 |
| Ito minus 26 FPE | 57+26* | 1.9% | 25 | 183 | 4 | 0 | 0 | 1 |
| Ito minus 26 random edges | 57+26* | 1.9% | 11 | 189 | 5 | 0 | 0 | 1 |
* edges removed for clustering + false positive or random edges removed
**Lehner plus FPE with 68 edges removed at random (for 100 replicates mean ± standard deviation)
cluster characteristics with and without false positives
| Dataset | Number of Edges Removed | Edges Removed % | Number of clusters size > 1 | Average Cluster Size | Biggest cluster(%) | Single Nodes(%) |
| Lehner | 57 | 14% | 21 | 15.6 | 60(18 %) | 1(0.3 %) |
| Lehner plus False Positive Edges (FPE) | 57 | 12.3% | 21 | 16.8 | 67(19.0%) | 1(0.3 %) |
| Lehner (FPE edges removed) | 57+68* | 26.9% | 21 | 15.3 | 56(15.9%) | 32(9.0 %) |
| Lehner random edges removed** | 57+68* | 26.9% | 24.0 ± 1.5 | 13.1 ± 0.8 | 39.2 ± 6.1(11.1%) | 39.2 ± 3.2(11.2%) |
| Ito minus FPE | 57+26* | 1.9% | 188 | 17.3 | 2798(85.5%) | 25(0.8 %) |
| Ito minus 26 random edges | 57+26* | 1.9% | 195 | 16.7 | 2787(85.2%) | 11(3.4 %) |
* edges removed for clustering + false positive or random edges removed
**Lehner plus FPE with 68 edges removed at random (for 100 replicates mean ± standard deviation)
cluster quality with and without false positives
| Dataset | Number of Edges Removed | Edges Removed % | GO per Cluster | GO per Node | Depth of GO per Node | Number of Clusters with no significant annotation |
| Lehner | 57 | 14% | 19.2 | 1.2 | 5.8 | 1 |
| Lehner plus False Positive Edges (FPE) | 57 | 12.2% | 18.2 | 1.1 | 5.7 | 1 |
| Lehner (FPE edges removed) | 57+68* | 26.9% | 19.33 | 1.3 | 5.7 | 10 |
| Lehner random edges removed** | 57+68* | 26.9% | 17.5 ± 4.6 | 1.3 ± 0.4 | 5.7 ± 0.08 | 4.2 ± 5.2 |
| Ito minus FPE | 57+26* | 1.9% | 1.3 | 0.1 | 4.7 | 149 |
| Ito minus 26 random edges | 57+26* | 1.9% | 1.3 | 0.1 | 4.7 | 146 |
* edges removed for clustering + false positive or random edges removed
**Lehner plus FPE with 68 edges removed at random (for 100 replicates mean ± standard deviation)