| Literature DB >> 30134824 |
Rongquan Wang1,2, Guixia Liu3,4, Caixia Wang5, Lingtao Su1,2, Liyan Sun1,2.
Abstract
BACKGROUND: In recent decades, detecting protein complexes (PCs) from protein-protein interaction networks (PPINs) has been an active area of research. There are a large number of excellent graph clustering methods that work very well for identifying PCs. However, most of existing methods usually overlook the inherent core-attachment organization of PCs. Therefore, these methods have three major limitations we should concern. Firstly, many methods have ignored the importance of selecting seed, especially without considering the impact of overlapping nodes as seed nodes. Thus, there may be false predictions. Secondly, PCs are generally supposed to be dense subgraphs. However, the subgraphs with high local modularity structure usually correspond to PCs. Thirdly, a number of available methods lack handling noise mechanism, and miss some peripheral proteins. In summary, all these challenging issues are very important for predicting more biological overlapping PCs.Entities:
Keywords: Core-attachment and local modularity structure; Node betweenness; Overlapping node; Protein complex; Protein-protein interaction networks; Seed-extension paradigm
Mesh:
Substances:
Year: 2018 PMID: 30134824 PMCID: PMC6106838 DOI: 10.1186/s12859-018-2309-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Definition and terminology are used to define overlapping PCs architecture. An example of overlapping PCs, whose core components consist of core nodes in the dashed circle. A PC consists of core components and attachments. Additionally, attachments consist of modules and some peripheral nodes. Note that among the attachments, a “module” is composed of overlapping nodes, and the rest of nodes are called peripheral node. The three types of nodes are marked by different colors. Two overlapping PCs are circled by solid lines
Fig. 2The formation process of a protein complex. The four type of nodes are marked by different colors. a the deep red protein represents the seed protein; b these red proteins inside the red dotted circle constitute a complex core; c these green proteins inside the green dotted circle represent peripheral proteins; d the yellow proteins inside the yellow dotted circle represent overlapping proteins; e the chocolate yellow proteins represent interspersed node; f complex core, peripheral proteins, and overlapping proteins inside the blue circle constitute a protein complex; An example illustrates the clustering process. This simple network has 22 nodes, and each edge has weight 0.2 except (0,1),(0,2),..., and (3,4). The node 0 is taken as a seed protein and the initial cluster {0} is constructed. In the greedy search process, the neighbors of the node 0 include {1,2,3,4,5,8,9}. The node 1 has the highest support function according to support function (Eq. (7)). We add node 1 to the cluster, and if the value of local modularity score increases, then this cluster is {0,1}. Similarly, the nodes 2, 3, and 4 are added to the cluster in sequence and now the neighbors of node 0 include 5, 8, 9 are left, the node 5 has the highest support function, but when the node 5 is added to the cluster {0,1,2,3,4}, its local modularity score decrease. Thus the node 5 is removed from the cluster and this greedy is terminated. Now the cluster {0,1,2,3,4} constitutes the complex core. We do the next greedy search to extend the complex core to form the whole complex. Furthermore, for the complex core {0,1,2,3,4}, its neighboring nodes have the nodes 5, 6, 7, 8, and 9, we repeat iteration this process for the cluster until the cluster isn’t change and save it as the first cluster. Similar, the next search will start from the next seed node to expand the next cluster
The properties of the three datasets used in the experimental study
| Dataset | Proteins | Interactions | Network density | Average no.of neighbors |
|---|---|---|---|---|
| Collins | 1622 | 9074 | 0.007 | 11.189 |
| Gavin | 1855 | 7119 | 0.004 | 8.268 |
| Krogan core | 2708 | 7123 | 0.002 | 5.261 |
The statistics of benchmark datasets
| Complex dataset | Overlapping complexes | Non-overlapping complexes | The sum of complexes |
|---|---|---|---|
| NewMIPS | 283(86.28%) | 45(13.72%) | 328(100%) |
| CYC2008 | 108(45.77%) | 128(54.23%) | 236(100%) |
Fig. 3Composite score using CYC2008 as benchmark with respect to various overlapping score thresholds. Comparison of the composite score of CALM and other three the state-of-the-art methods from different weighted network with respect to different overlapping scores threshold (from 0.1 to 1 with 0.1 increment). Various PPI datasets include a Collins et al., b Gavin et al., c Krogan core et al. The value of the composite score include ACC, Fraction, and MMR
Fig. 4Composite score using NewMIPS as benchmark with respect to various overlapping score thresholds. Comparison of the composite score of CALM and other three the state-of-the-art methods from weighted network with respect to different overlapping scores threshold (from 0.1 to 1 with 0.1 increment). Various PPI datasets include a Collins et al., b Gavin et al., c Krogan core et al. The value of the composite score include ACC, Fraction, and MMR
Fig. 5Prediction performance on three PPINs and CYC2008 is used as benchmark. The comparisons are in terms of the geometric accuracy (ACC), the fraction of reference complexes which are matched by at least one predicted cluster (Fraction), and the maximum matching ratio (MMR). Various PPI datasets include a Collins et al., b Gavin et al., c Krogan core et al. The total height height of each bar is the value of the composite scores of three metrics on a given network. Larger scores are better
Fig. 6Prediction performance on three PPINs and NewMIPS is used as benchmark. The comparisons are in terms of the geometric accuracy (ACC), the fraction of reference complexes which are matched by at least one predicted cluster (Fraction), and the maximum matching ratio (MMR). Various PPI datasets include a Collins et al., b Gavin et al., c Krogan core et al. The total height height of each bar is the value of the composite scores of three metrics on a given network. Larger scores are better