| Literature DB >> 30572820 |
Bo Xu1,2, Kun Li3, Wei Zheng4,5, Xiaoxia Liu4, Yijia Zhang4, Zhehuan Zhao3,6, Zengyou He3,6.
Abstract
BACKGROUND: Identifying protein complexes from protein-protein interaction (PPI) network is one of the most important tasks in proteomics. Existing computational methods try to incorporate a variety of biological evidences to enhance the quality of predicted complexes. However, it is still a challenge to integrate different types of biological information into the complexes discovery process under a unified framework. Recently, attributed network embedding methods have be proved to be remarkably effective in generating vector representations for nodes in the network. In the transformed vector space, both the topological proximity and node attributed affinity between different nodes are preserved. Therefore, such attributed network embedding methods provide us a unified framework to integrate various biological evidences into the protein complexes identification process.Entities:
Keywords: Network embedding; Protein complexes identification; Protein-protein interaction network
Mesh:
Substances:
Year: 2018 PMID: 30572820 PMCID: PMC6302388 DOI: 10.1186/s12859-018-2555-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The basic idea of GANE to predict protein complexes from protein-protein interaction networks. The GANE method for protein complex prediction is a two-step procedure. Firstly, it learns the vector representation for each protein from the GO attributed PPI network. Based on the pair-wise vector representation similarity, a weighted adjacency matrix is constructed. Secondly, it uses a clique mining method to generate candidate cores. A set of seed cores are generated from the set of candidate cores with density-based clique ranking and redundancy-based clique updating. For each seed core, its attachments are those proteins with correlation scores that are larger than a threshold. The seed cores with attachments are the predicted protein complexes
Major steps of GANE
| Algorithm 1 Protein complex identification algorithm GANE |
|---|
| Input: Graph |
| Output: A set of discovered protein complexes |
| Description: |
| Constructing a protein attribute affinity matrix |
| Generating vector representation for each protein |
| Constructing a weighted adjacency matrix |
| Initializing |
| Generating maximal cliques and put them into |
| While |
| DescendSort( |
| Pruning and updating remaining cliques in |
| End while |
| For core |
| finding the set of its attachments |
| End for |
| Return |
The PPI data sets used in the experiment
| PPI networks | Number of proteins | Number of interactions | Average clustering coefficient | Average number of neighbors |
|---|---|---|---|---|
| DIP | 4928 | 17,201 | 0.095 | 6.981 |
| Krogan-core | 2708 | 7123 | 0.188 | 5.261 |
| Krogan14k | 3581 | 14,076 | 0.122 | 7.861 |
| Biogrid | 5640 | 59,748 | 0.246 | 21.187 |
| Collins | 1622 | 9074 | 0.555 | 11.189 |
Five yeast PPI networks were used in the performance comparison: DIP (Xenarios et al., 2002), Krogan-core (Krogan et al., 2006), Krogan14k (Krogan et al., 2006), Biogrid (Stark et al., 2006), Collins (Collins et al., 2007)
Performance comparison based on six evaluation metrics on the five yeast data
| Datasets | Methods | #predicted complexes | #matched complexes |
|
|
|
|
|---|---|---|---|---|---|---|---|
| DIP | COACH | 570 | 263 | 0.450 | 0.620 | 0.521 | 0.243 |
| CMC | 179 | 108 | 0.603 | 0.394 | 0.477 | 0.219 | |
| MCODE | 59 | 32 | 0.542 | 0.118 | 0.194 | 0.149 | |
| ClusterOne | 341 | 133 | 0.390 | 0.343 | 0.365 | 0.227 | |
| MCL | 451 | 69 | 0.153 | 0.172 | 0.162 | 0.190 | |
| PEWCC | 666 | 413 | 0.620 | 0.469 | 0.534 | 0.230 | |
| GANE | 324 | 202 | 0.623 | 0.550 |
|
| |
| Krogan-core | COACH | 348 | 206 | 0.592 | 0.460 | 0.518 | 0.217 |
| CMC | 128 | 86 | 0.672 | 0.304 | 0.419 | 0.206 | |
| MCODE | 71 | 52 | 0.732 | 0.198 | 0.311 | 0.176 | |
| ClusterOne | 522 | 190 | 0.364 | 0.464 | 0.408 |
| |
| MCL | 376 | 126 | 0.335 | 0.414 | 0.371 | 0.262 | |
| PEWCC | 630 | 425 | 0.675 | 0.406 | 0.507 | 0.214 | |
| GANE | 208 | 161 | 0.774 | 0.436 |
| 0.229 | |
| Krogan14k | COACH | 570 | 263 | 0.461 | 0.465 | 0.463 | 0.217 |
| CMC | 396 | 187 | 0.472 | 0.440 | 0.455 | 0.210 | |
| MCODE | 49 | 30 | 0.612 | 0.112 | 0.189 | 0.152 | |
| ClusterOne | 225 | 105 | 0.467 | 0.302 | 0.366 | 0.222 | |
| MCL | 445 | 133 | 0.299 | 0.323 | 0.311 | 0.233 | |
| PEWCC | 934 | 500 | 0.535 | 0.418 | 0.470 | 0.217 | |
| GANE | 247 | 169 | 0.684 | 0.442 |
|
| |
| Biogrid | COACH | 1507 | 469 | 0.311 | 0.657 | 0.422 | 0.276 |
| CMC | 1503 | 236 | 0.157 | 0.553 | 0.245 | 0.265 | |
| MCODE | 58 | 16 | 0.276 | 0.043 | 0.075 | 0.181 | |
| ClusterOne | 476 | 187 | 0.393 | 0.497 | 0.439 |
| |
| MCL | 338 | 77 | 0.228 | 0.219 | 0.223 | 0.249 | |
| PEWCC | 2781 | 1044 | 0.375 | 0.677 | 0.483 | 0.288 | |
| GANE | 637 | 347 | 0.545 | 0.664 |
| 0.310 | |
| Collins | COACH | 251 | 188 | 0.749 | 0.522 | 0.615 | 0.280 |
| CMC | 153 | 104 | 0.680 | 0.390 | 0.496 | 0.255 | |
| MCODE | 111 | 94 | 0.847 | 0.400 | 0.540 | 0.254 | |
| ClusterOne | 195 | 143 | 0.733 | 0.511 | 0.602 | 0.290 | |
| MCL | 183 | 134 | 0.732 | 0.506 | 0.598 | 0.286 | |
| PEWCC | 570 | 477 | 0.837 | 0.426 | 0.564 | 0.252 | |
| GANE | 199 | 163 | 0.819 | 0.491 |
|
|
Both F-score and Acc are overall evaluation metrics, so the highest values of F-score and Acc are set in bold for each dataset
Fig. 2Comparison with six protein complex detection algorithms in terms of the composite score of F-score and Acc. Shades of the same color indicate different evaluating scores. Each bar height reflects the value of the composite score
Examples of predicted complexes on the DIP dataset
| ID | Protein complex | Matched or not | Min | GO-Description |
|---|---|---|---|---|
| 1 | YLR376C YHL006C YIL132C YDR078C | No | 1.95e-10 | DNA recombinase assembly |
| 2 | YFR015C YJL137C YLR258W | No | 9.79e-07 | Glycogen biosynthetic process |
| 3 | YLR078C YLR026C YDR189W YDR498C YLR268W YOR075W | No | 1.42e-12 | Vesicle fusion |
| 4 | YDR331W YMR298W YKL008C YHL003C YGR060W | No | 3.96e-07 | Ceramide biosynthetic process |
| 5 | YLR409C YER082C YKR060W YJR002W YPR144C YER127W YNL132W YDR299W YNL308C YCL059C YJL069C YCR057C YDR324C YGR145W | No | 6.24e-23 | Ribosomal small subunit biogenesis |
| 6 | YOR016C YHR140W YHL042W YBR106W YCR101C YDR414C YEL017C-A YAR028W YGL259W YKL065C YGL042C YER039C YJL004C YPL264C | No | 0.00014 | Protein localization to endoplasmic reticulum |
Fig. 3The sensitivity of GANE with respect to three parameters. a The performance of GANE when embedding vector dimension d was varied from 32 to 224. b The performance of GANE when harmonic value λ was varied from 0.00001 to 1000. c The performance of GANE when threshold value θ was varied from 0.1 to 0.9