| Literature DB >> 29292776 |
Jie Wang1, Wenping Zheng2, Yuhua Qian3, Jiye Liang4.
Abstract
Most proteins perform their biological functions while interacting as complexes. The detection of protein complexes is an important task not only for understanding the relationship between functions and structures of biological network, but also for predicting the function of unknown proteins. We present a new nodal metric by integrating its local topological information. The metric reflects its representability in a larger local neighborhood to a cluster of a protein interaction (PPI) network. Based on the metric, we propose a seed-expansion graph clustering algorithm (SEGC) for protein complexes detection in PPI networks. A roulette wheel strategy is used in the selection of the seed to enhance the diversity of clustering. For a candidate node u, we define its closeness to a cluster C, denoted as NC(u, C), by combing the density of a cluster C and the connection between a node u and C. In SEGC, a cluster which initially consists of only a seed node, is extended by adding nodes recursively from its neighbors according to the closeness, until all neighbors fail the process of expansion. We compare the F-measure and accuracy of the proposed SEGC algorithm with other algorithms on Saccharomyces cerevisiae protein interaction networks. The experimental results show that SEGC outperforms other algorithms under full coverage.Entities:
Keywords: graph clustering; protein complex detection; protein interaction network; seed expansion
Mesh:
Substances:
Year: 2017 PMID: 29292776 PMCID: PMC6150027 DOI: 10.3390/molecules22122179
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Description of the main symbols used in this paper.
| Symbol | Description |
|---|---|
| A graph G including a node set | |
| The number of nodes in a graph | |
| The number of edges in a graph | |
| The | |
| The edge in | |
| The distance between node | |
| The node set of a subgraph | |
| The attribute (feature) matrix of nodes in a graph | |
| The weight vector of the node attributes | |
| The maximum number of iterations | |
| The weight matrix of nodes | |
| The weight of a node or an edge | |
| Probability of node | |
| The cluster (subgraph) with node | |
| The closeness between node u and subgraph | |
| The parameter to control two items in | |
| Reduce rate of | |
| Diameter of a graph | |
| The user-defined threshold of | |
| The user-defined threshold of diameter |
Figure 1An example network. Although node , and have the same degree, they have different representability to a subgraph from Equation (5).
Protein-protein interaction (PPI) datasets.
| Items | Gavin02 | Gavin06 | Krogan_Core | Krogan_Extend | BioGrid |
|---|---|---|---|---|---|
| Proteins | 1352 | 1430 | 2708 | 3672 | 4187 |
| Interactions | 3210 | 6531 | 7123 | 14317 | 20454 |
| Density | 0.0035 | 0.0064 | 0.0019 | 0.0021 | 0.0023 |
| Throughput | High | High | High | High | Low |
Figure 2The effect of parameters on the performance of seed-expansion graph clustering (SEGC) on BioGrid: (a) the effect of and ; (b) the effect of and .
Comparison results of IPCA algorithm with new node weighing method (IPCA-NW), IPCA algorithm with roulette wheel method (IPCA-RW) and (IPCA algorithm with NC metric (IPCA-NC) in Equation (8)) with original IPCA.
| Network | Criteria | IPCA | IPCA-NW | IPCA-RW | IPCA-NC |
|---|---|---|---|---|---|
| Gavin02 | Precision | 0.4675 | 0.4686 | 0.4851 | 0.5462 |
| Recall | 0.3505 | 0.3505 | 0.3505 | 0.3603 | |
| 0.4006 | 0.4010 | 0.4070 | 0.4342 | ||
| PPV | 0.5541 | 0.5532 | 0.5522 | 0.5578 | |
| Sn | 0.3646 | 0.3646 | 0.3646 | 0.4141 | |
| Accuracy | 0.4495 | 0.4491 | 0.4487 | 0.4806 | |
| Gavin06 | Precision | 0.5289 | 0.5298 | 0.5460 | 0.4603 |
| Recall | 0.3750 | 0.3750 | 0.3750 | 0.3750 | |
| 0.4389 | 0.4392 | 0.4446 | 0.4133 | ||
| PPV | 0.5375 | 0.5375 | 0.5447 | 0.5299 | |
| Sn | 0.4807 | 0.4807 | 0.4797 | 0.5021 | |
| Accuracy | 0.5083 | 0.5083 | 0.5112 | 0.5158 | |
| Krogan_core | Precision | 0.4732 | 0.4744 | 0.4857 | 0.4769 |
| Recall | 0.5662 | 0.5637 | 0.5686 | 0.5735 | |
| 0.5155 | 0.5152 | 0.5239 | 0.5208 | ||
| PPV | 0.6058 | 0.6054 | 0.6037 | 0.6164 | |
| Sn | 0.5786 | 0.5776 | 0.5792 | 0.5891 | |
| Accuracy | 0.5921 | 0.5913 | 0.5913 | 0.6026 | |
| Krogan_extend | Precision | 0.4114 | 0.4120 | 0.4185 | 0.4434 |
| Recall | 0.4926 | 0.4926 | 0.4951 | 0.5466 | |
| 0.4484 | 0.4487 | 0.4536 | 0.4896 | ||
| PPV | 0.5234 | 0.5250 | 0.5304 | 0.5499 | |
| Sn | 0.5974 | 0.5974 | 0.5979 | 0.6135 | |
| Accuracy | 0.5592 | 0.5600 | 0.5631 | 0.5809 | |
| BioGrid | Precision | 0.5075 | 0.5083 | 0.5135 | 0.5316 |
| Recall | 0.8088 | 0.8088 | 0.8088 | 0.8260 | |
| 0.6237 | 0.6243 | 0.6282 | 0.6469 | ||
| PPV | 0.4482 | 0.4480 | 0.4485 | 0.4748 | |
| Sn | 0.7885 | 0.7885 | 0.7880 | 0.8115 | |
| Accuracy | 0.5945 | 0.5944 | 0.5945 | 0.6207 |
Parameters of each algorithms.
| Algorithm | Parameter | Value |
|---|---|---|
| CFinder | 3 | |
| DPClus | cluster property value | 0.5 |
| density | 0.7 | |
| IPCA | interaction probability | 0.4 |
| diameter | 2 | |
| SR-MCL | inflation | 2 |
| balance | 0.5 | |
| iterations | 30 | |
| penalty ratio | 1.25 | |
| quality function | 1.2 | |
| overlap threshold | 0.6 | |
| PEWCC | join parameter | 0.5 |
| overlap threshold | 0.8 | |
| DCU | expected density | 0.2 |
| WCOACH | neighborhood affinity threshold | 0.85 |
| WEC | balance factor | 0.8 |
| edge weight | 0.7 | |
| enrichment | 0.8 | |
| filtering | 0.9 |
The evaluation results by different algorithms on five PPI networks.
| Network | Criteria | SEGC | CFinder | DPClus | IPCA | Core | SR-MCL | PEWCC | DCU | WCOACH | WEC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gavin02 | Precision | 0.5621 | 0.7333 | 0.4679 | 0.4675 | 0.3717 | 0.7818 | 0.5154 | 0.3897 | 0.6311 | 0.7137 |
| Recall | 0.3603 | 0.1373 | 0.3088 | 0.3505 | 0.3505 | 0.1838 | 0.2034 | 0.2990 | 0.1520 | 0.1667 | |
| 0.2312 | 0.3721 | 0.4006 | 0.3608 | 0.2977 | 0.2917 | 0.3384 | 0.2449 | 0.2702 | |||
| PPV | 0.5597 | 0.4150 | 0.6207 | 0.5541 | 0.6153 | 0.5089 | 0.5558 | 0.4184 | 0.3310 | 0.5936 | |
| Sn | 0.4146 | 0.3203 | 0.2755 | 0.3646 | 0.3646 | 0.2833 | 0.2776 | 0.4490 | 0.4188 | 0.2531 | |
| Accuracy | 0.3646 | 0.4135 | 0.4495 | 0.4736 | 0.3797 | 0.3928 | 0.4334 | 0.3723 | 0.3876 | ||
| Coverage | 1352 ( | 623 (46%) | 690 (51%) | 1352 (100%) | 1041 (77%) | 584 (43%) | 599 (44%) | 1350 (100%) | 1034 (76%) | 502 (37%) | |
| Gavin06 | Precision | 0.4754 (0.5030) | 0.6633 | 0.5502 | 0.5289 | 0.4869 | 0.7512 | 0.4687 | 0.3295 | 0.4742 | 0.7774 |
| Recall | 0.3750 (0.4706) | 0.1912 | 0.3873 | 0.3750 | 0.3627 | 0.3088 | 0.3456 | 0.2451 | 0.2328 | 0.2941 | |
| 0.4193 ( | 0.2968 | 0.4389 | 0.4157 | 0.4377 | 0.3978 | 0.2811 | 0.3123 | 0.4268 | |||
| PPV | 0.5335 (0.6110) | 0.3425 | 0.6413 | 0.5375 | 0.5833 | 0.5286 | 0.5585 | 0.2959 | 0.3300 | 0.5735 | |
| Sn | 0.5021 (0.4661) | 0.5125 | 0.4307 | 0.4807 | 0.4599 | 0.4849 | 0.4307 | 0.5318 | 0.5500 | 0.4479 | |
| Accuracy | 0.5176 ( | 0.4190 | 0.5083 | 0.5180 | 0.5063 | 0.4905 | 0.3966 | 0.4261 | 0.5068 | ||
| Coverage | 1430 ( | 1124 (79%) | 1056 (74%) | 1430 (100%) | 1144 (80%) | 1135 (79%) | 1081 (76%) | 1413 (99%) | 1335 (93%) | 947 (66%) | |
| Krogan_core | Precision | 0.4889 | 0.6174 | 0.3626 | 0.4732 | 0.2960 | 0.7341 | 0.5379 | 0.2272 | 0.5166 | 0.8382 |
| Recall | 0.5760 | 0.2034 | 0.5931 | 0.5662 | 0.5907 | 0.3309 | 0.3431 | 0.4779 | 0.2549 | 0.2770 | |
| 0.3060 | 0.4501 | 0.5155 | 0.3943 | 0.4562 | 0.4190 | 0.3080 | 0.3414 | 0.4163 | |||
| PPV | 0.6222 | 0.3588 | 0.7128 | 0.6058 | 0.6308 | 0.6063 | 0.5550 | 0.3180 | 0.2231 | 0.6603 | |
| Sn | 0.5885 | 0.4802 | 0.4885 | 0.5786 | 0.5109 | 0.4620 | 0.4135 | 0.5964 | 0.5849 | 0.3937 | |
| Accuracy | 0.4151 | 0.5901 | 0.5921 | 0.5677 | 0.5293 | 0.4791 | 0.4355 | 0.3612 | 0.5099 | ||
| Coverage | 2708 ( | 1143 (42%) | 1727 (64%) | 2708 (100%) | 2082 (77%) | 1188 (44%) | 1101 (41%) | 2660 (98%) | 2112 (78%) | 866 (32%) | |
| Krogan_extend | Precision | 0.4517 | 0.4545 | 0.3187 | 0.4114 | 0.2036 | 0.7627 | 0.4259 | 0.1450 | 0.2381 | 0.7901 |
| Recall | 0.5466 | 0.1495 | 0.5711 | 0.4926 | 0.5833 | 0.2794 | 0.4044 | 0.4265 | 0.1789 | 0.2157 | |
| 0.2250 | 0.4091 | 0.4484 | 0.3019 | 0.4090 | 0.4149 | 0.2164 | 0.2043 | 0.3389 | |||
| PPV | 0.5564 | 0.2223 | 0.6738 | 0.5234 | 0.6326 | 0.5977 | 0.5179 | 0.2931 | 0.1028 | 0.5935 | |
| Sn | 0.6130 | 0.5625 | 0.5005 | 0.5974 | 0.5125 | 0.4495 | 0.4865 | 0.6271 | 0.6833 | 0.3786 | |
| Accuracy | 0.3536 | 0.5807 | 0.5592 | 0.5694 | 0.5183 | 0.5019 | 0.4288 | 0.2650 | 0.4740 | ||
| Coverage | 3672 ( | 1596 (43%) | 1948 (53%) | 3672 (100%) | 2669 (73%) | 1282 (35%) | 1567 (43%) | 3668 (100%) | 3309 (90%) | 905 (25%) | |
| BioGrid | Precision | 0.5377 | 0.4225 | 0.3736 | 0.5075 | 0.2467 | 0.5872 | 0.4923 | 0.1530 | 0.1640 | 0.6600 |
| Recall | 0.8284 | 0.1520 | 0.7402 | 0.8088 | 0.6667 | 0.5098 | 0.7721 | 0.3113 | 0.2598 | 0.4706 | |
| 0.2235 | 0.4965 | 0.6237 | 0.3602 | 0.5458 | 0.6012 | 0.2051 | 0.2011 | 0.5494 | |||
| PPV | 0.4741 | 0.1616 | 0.6031 | 0.4482 | 0.5231 | 0.5019 | 0.5002 | 0.2086 | 0.1530 | 0.4685 | |
| Sn | 0.8104 | 0.8755 | 0.6776 | 0.7885 | 0.7453 | 0.7479 | 0.7344 | 0.8875 | 0.9370 | 0.6922 | |
| Accuracy | 0.6199 | 0.3762 | 0.5945 | 0.6244 | 0.6127 | 0.6061 | 0.4303 | 0.3786 | 0.5695 | ||
| Coverage | 4187 ( | 2740 (65%) | 2599 (62%) | 4187 (100%) | 3243 (80%) | 2764 (66%) | 2632 (63%) | 4168 (99%) | 3904 (93%) | 2011 (48%) |
Performance of seed-expansion graph clustering (SEGC) on data sets.
| Criteria | Gavin02 | Gavin06 | Krogan_Core | Krogan_Extend | BioGrid |
|---|---|---|---|---|---|
| Precision | 0.5520 ± 1.1347 × 10−5 | 0.4634 ± 1.5535 × 10−5 | 0.4812 ± 7.0585 × 10−6 | 0.4465 ± 3.9754 × 10−6 | 0.5317 ± 4.8829 × 10−6 |
| Recall | 0.3603 ± 7.7192 × 10−30 | 0.3708 ± 9.4804 × 10−6 | 0.5727 ± 3.1782 × 10−6 | 0.5425 ± 5.7244 × 10−6 | 0.8257 ± 2.9649 × 10−6 |
| 0.4360 ± 1.1076 × 10−6 | 0.4120 ± 7.4338 × 10−6 | 0.5230 ± 2.9506 × 10−6 | 0.4898 ± 2.6329 × 10−6 | 0.6468 ± 3.0660 × 10−6 | |
| PPV | 0.5564 ± 7.5655 × 10−6 | 0.5327 ± 8.7280× 10−6 | 0.6227 ± 1.0720 × 10−5 | 0.5548 ± 2.4239 × 10−6 | 0.4752 ± 2.4318 × 10−6 |
| Sn | 0.4147 ± 1.9255 × 10−7 | 0.5012 ± 7.4491 × 10−7 | 0.5882 ± 3.4486 × 10−7 | 0.6121 ± 5.7970 × 10−7 | 0.8111 ± 7.6118 × 10−7 |
| Accuracy | 0.4803 ± 1.7024 × 10−6 | 0.5167 ± 1.9278 × 10−6 | 0.6052 ± 2.5290 × 10−6 | 0.5828 ± 8.2139 × 10−7 | 0.6209 ± 1.1542× 10−6 |
Figure 3Examples of predicted complexes matching standard complexes: (a) NuA4 histone acetyltransferase complex predicted by SEGC on BioGrid; (b) Arp2/3 protein complex predicted by SEGC on Gavin02; (c) transport protein particle (TRAPP) complex predicted by SEGC on Gavin06; (d) transcription factor TFIIIC complex predicted by SEGC on Krogan_extend; (e) carboxy-terminal domain protein kinase complex predicted by SEGC on Gavin06.
Figure 4Examples of predicted complexes in which none of proteins is labeled by any of standard complexes: (a) a predicted complex by SEGC on BioGrid; (b) another predicted complex by SEGC on BioGrid.
Examples of predicted complexes by SEGC.
| ID | Predicted Complexes | NA | Biological Processes | Molecular Functions | Cellular Components | |||
|---|---|---|---|---|---|---|---|---|
| GO Term | GO Term | GO Term | ||||||
| 1 | YLR370C YIL062C YKL013C YNR035C YJR065C YDL029W YBR234C | 1 | actin cytoskeleton organization (GO:0030036) | 1.59 × 10−11 | adenyl ribonucleotide binding (GO:0032559) | 0.00469 | Arp2/3 protein complex (GO:0005885) | 9.17 × 10−22 |
| 2 | YBR254C YKR068C YDR472W YDR108W YOR115C YGR166W YDR407C YMR218C YML077W YDR246W | 1 | Golgi vesicle transport (GO:0048193) | 7.60 × 10−15 | Rab guanyl-nucleotide exchange factor activity (GO:0017112) | 9.00 × 10−20 | TRAPP complex (GO:0030008) | 4.05 × 10−30 |
| 3 | YPL007C YBR123C YOR110W YAL001C YGR047C YDR362C | 1 | transcription from RNA polymerase III type 2 promoter (GO:0001009) | 1.91 × 10−19 | RNA polymerase III type 2 promoter sequence-specific DNA binding (GO:0001003) | 1.75 × 10−19 | transcription factor TFIIIC complex (GO:0000127) | 1.01× 10−19 |
| 4 | YJR082C YFL024C YOR244W YNL107W YJL081C YOL012C YFL039C YGR002C YHR090C YHR099W YNL136W YDR359C YEL018W YPR023C YDR485C | 0.87 | histone acetylation (GO:0016573) | 2.67 × 10−17 | histone acetyltransferase activity (GO:0004402) | 2.06 × 10−13 | NuA4 histone acetyltransferase complex (GO:0035267) | 1.26 × 10−34 |
| 5 | YKL139W YJL006C YML112W YAL005C YBR169C | 0.6 | positive regulation of translational fidelity (GO:0045903) | 3.22 × 10−7 | - | - | carboxy-terminal domain protein kinase complex (GO:0032806) | 2.37 × 10−6 |
| 6 | YAL058W YFR042W YPR159W YOR336W YGR143W | - | beta-glucan biosynthetic process (GO:0051274) | 3.23 × 10−9 | glucosidase activity (GO:0015926) | 0.00067 | integral component of endoplasmic reticulum membrane (GO:0030176) | 0.00011 |
| 7 | YNL263C YGR172C YGL198W YKR014C YML001W YOR089C YNL093W YLR262C YER136W YBR264C YNL044W YER031C YFL038C | - | vesicle-mediated transport (GO:0016192) | 1.67 × 10−11 | GTPase activity (GO:0003924) | 2.24 × 10−13 | cytoplasmic vesicle (GO:0031410) | 8.39 × 10−8 |