| Literature DB >> 35685359 |
Sara Omranian1,2,3, Zoran Nikoloski4,5, Dominik G Grimm1,2,3,6.
Abstract
Physically interacting proteins form macromolecule complexes that drive diverse cellular processes. Advances in experimental techniques that capture interactions between proteins provide us with protein-protein interaction (PPI) networks from several model organisms. These datasets have enabled the prediction and other computational analyses of protein complexes. Here we provide a systematic review of the state-of-the-art algorithms for protein complex prediction from PPI networks proposed in the past two decades. The existing approaches that solve this problem are categorized into three groups, including: cluster-quality-based, node affinity-based, and network embedding-based approaches, and we compare and contrast the advantages and disadvantages. We further include a comparative analysis by computing the performance of eighteen methods based on twelve well-established performance measures on four widely used benchmark protein-protein interaction networks. Finally, the limitations and drawbacks of both, current data and approaches, along with the potential solutions in this field are discussed, with emphasis on the points that pave the way for future research efforts in this field.Entities:
Keywords: Network Clustering Algorithms; Network embedding; Protein Complex Prediction; Protein-Protein interaction network
Year: 2022 PMID: 35685359 PMCID: PMC9166428 DOI: 10.1016/j.csbj.2022.05.049
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Categories of the network clustering algorithm used in the protein complex prediction with PPI networks. The network clustering algorithms require as input either only a PPI network (methods in black color) or both on PPI network and biological information (methods in red color). Regardless of the input, the existing network clustering algorithms with applications to complex prediction can be divided into three categories, namely: node affinity-based, cluster quality-based, and network embedding-based methods. For each category, several examples are given and explained in this review. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Summary of protein–protein interaction networks.
| Name | Version / update date | Species | #Proteins | #Interactions |
|---|---|---|---|---|
| DIP | 5/Feb/2017 | All | 28,255 | 76,881 |
| BioGRID | 4.4.206 | All | 80,939 | 1,191,174 |
| STRING | 11.5 | All | 67.6 mio | >20 bln |
| Babu | 27/Nov/2017 | 2,045 | 12,801 | |
| Cong | 12/Jul/2019 | 1,476 | 1,618 | |
| Collins | Mar/2007 | S. cerevisiae | 1,622 | 9,074 |
| Gavin | Jan/2006 | S. cerevisiae | 1,855 | 7,669 |
| Krogan | Marc/2006 | S. cerevisiae | 6,380 | 21,440 |
| PIPs | v1.1 | H. sapiens | 5,751 | 79,441 |
Summary of protein complex gold standards.
| Name | Species | #Proteins | #Complexes | #Complexes |
|---|---|---|---|---|
| CYC2008 | S. cerevisiae | 1,627 | 408 | 236 |
| SGD | S. cerevisiae | 1,279 | 323 | 238 |
| CORUM | H. sapiens | 4,479 | 4,274 | 2,783 |
| EcoCyc | 749 | 299 | 181 | |
| Met | 475 | 206 | 118 |
Fig. 2Categories of computational approaches to detect protein complexes. Node affinity-based approaches use different node scoring methods, while cluster quality-based approaches cast the protein complex prediction as an optimization problem on PPI networks. However, the next steps to find protein complexes are almost the same for both categories. The network embedding-based approaches predict protein complexes, first by transforming each node to a vector, which is followed by finding similarities between pairs of node vectors. Lastly, they utilize any network clustering algorithms to find protein complexes.
Overview of computational approaches for prediction of protein complexes from PPI networks. The current state-of-the-art methods are divided into three categories: node-affinity, cluster-quality, and network embedding-based approaches. The input of each method is shown in the second column. A link to the public implementation of each method (if available) along with the year of publication is given in the third column. Other properties such as the number of parameters, the capability of the method to use edge-weights or to predict overlapping protein complexes are given in the last three columns, respectively.
| Category | Biological Knowledge/data | Method – Website | Feature(s) |
|---|---|---|---|
| Node Affinity-based approaches | × | MCL [2002] | MCL has 2 parameters and utilizes edge weights. It detects non-overlapping clusters. The size of the clusters depends on the inflation parameter. |
| MCODE [2003] | MCODE depends on 5 parameters and does not utilize the edge weights. By setting the fluff parameter, it can detect overlapping clusters. The predicted clusters are of high density. MCODE is unable to find sparse clusters. | ||
| CFinder [2006] | CFinder has 2 parameters and employs edge weights. The predicted clusters have a clique topology. CFinder detects overlapping clusters, while it is unable to find sparse ones. | ||
| AP [2007] | AP has 1 parameter, that affects the cluster formation, and it does not use edge weights. It detects non-overlapping and dense clusters. | ||
| CMC [2009] | CMC has 2 parameters and employs edge weights. The clusters have a clique topology. CMC is unable to find sparse clusters. The size of the clusters depends on the parameters. CMC can detect overlapping clusters. | ||
| PEWCC [2013] | PEWCC has 2 parameters and uses edge weight. It deals with false-positive interactions by introducing a PE-score, while it does not consider the effect of false-negative ones. PEWCC detects highly overlapped and repetitive clusters. | ||
| ProRank + [2014] | ProRank + has 2 parameters and employs edge weights. It considers the effect of false-positive interactions but not the false-negative ones. ProRank + detects overlapping clusters. | ||
| DPC-NADPIN [2016] | DPC-NADPIN has 2 parameters and does not utilize edge weights. It incorporates gene expression data to create a dynamic PPI network. It is unable to predict small clusters. DPC-NADPIN detects overlapping clusters. | ||
| idenPC-MIIP [2020] | idenPC-MIIP has 2 parameters and employs edge weights. It considers the effect of false-positive interactions by calculating MIIP-score. idenPC-MIIP can detect overlapping clusters. | ||
| Microarray data | |||
| DMSP [2007] | DMSP depends on 2 parameters. It considers the effect of false-positive edges by calculating the gene-expression similarity between pairs of protein. DMSP can predict non-overlapping clusters. | ||
| Cluster quality-based approaches | × | miPALM [2010] | miPALM has 2 parameters and assigns edge-weights. It detects dense clusters and is unable to predict small and sparse clusters. miPALM predicts overlapping clusters; however, it does not consider the effect of false-positive and false-negative interactions. |
| ClusterOne [2012] | ClusterOne has 3 parameters and it utilizes edge weights. It is unable to find small and sparse clusters. ClusterOne predicts overlapping clusters; however, it does not consider the effect of false-negative interactions. | ||
| Core&Peel [2016] | Core&Peel depends on 3 parameters and it uses the edge weights. It predicts dense complexes. The size and density of the clusters depends on 2 parameters. Core&Peel can detect overlapping clusters; however, it does not consider the effect of false-negative interactions. | ||
| IMHRC [2017] | IMHRC has 5 parameters and it employs edge weights. It is unable to find small and sparse clusters. IMHRC can detect overlapping clusters; however, it does not consider the effect of false-negative interactions. | ||
| PC2P [2020] | PC2P is a parameter-free algorithm. It can detect small and large as well as sparse and dense clusters. However, it does not utilize edge weights, but can detects non-overlapping clusters. | ||
| CC [2021] | CC is a parameter-free approach. It can detect small and large as well as sparse and dense clusters. However, it does not utilize edge weights, and can detect non-overlapping clusters. | ||
| OCC [2021] | OCC is a parameter-free approach. It can detect small and large as well as sparse and dense clusters. Although it does not utilize edge weights, it can detect overlapping clusters. | ||
| WCC [2021] | WCC is a parameter-free approach. It can detect small and large as well as sparse and dense clusters. While it utilizes edge weights, it can detect non-overlapping clusters. | ||
| OWCC [2021] | OWCC is a parameter-free approach that uses edge weights. It can detects small and large as well as sparse and dense clusters. OWCC detects overlapping clusters, however it does not consider the effect of false-negative interactions. | ||
| CUBCO [2022] | CUBCO is a parameter-free approach that uses edge weights. It can detect small and large as well as sparse and dense clusters. CUBCO considers the effect of false-negative as well as false-positive interactions; however, it cannot detect overlapping clusters. | ||
| Functional homogeneity | RNSC [2004] | RNSC depends on 7 parameters and it does not consider edge weights. RNSC is a randomized algorithm and in each round, it generates different clusters. It is highly dependent on the initial clusters and it is unable to detect overlapping clusters. | |
| Network embedding-based approaches | × | CPNM [2020] | CPNM has 6 parameters and uses edge weights. It finds non-overlapping clusters. CPNM detects dense clusters and not sparse ones. |
| DPCMNE [2021] | DPCMNE is dependent on 5 parameters and uses the edge weights. It is not able to detect sparse clusters, but it can detect overlapping clusters. | ||
| Gene Ontology | GANE [2018] | GANE has 3 parameters and it utilizes edge weights. While it cannot detect sparse clusters, it is able to predict overlapping clusters. |
Fig. 3GO semantic similarity analysis of protein complexes of gold standards. The distribution of median GO semantic similarity of reference complexes is compared with the randomly generated complexes from altogether five gold standards for three species: (A) E. Coli, (B) S. cerevisiae, and (C) H. Sapiens and their randomized variants.
Fig. 4Comparative analysis of approaches for prediction of protein complexes. Eighteen state-of-the-art approaches are applied on four PPI networks of S. cerevisiae, which are (A) Collins, (B) Gavin, (C) KroganCore, and (D) KroganExt. The predicted clusters from different approaches are compared with protein complexes in the gold standard CYC2008. The comparative analysis is conducted with respect to a composite score, which is the summation of four performance measures, maximum matching ratio (MMR), fraction match (FRM), accuracy (ACC), and F-measure. Eighteen approaches are ordered first by their categories, node affinity-based (in brown), cluster quality-based (in green), and network embedding-based (in pink). Second, the methods in each category are ordered by the year of publication. The result indicates that the cluster quality-based methods, more specifically, those that model a protein complex as a biclique spanned subgraph outperformed the others. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)