| Literature DB >> 31521132 |
Rongquan Wang1,2, Guixia Liu3,4, Caixia Wang5.
Abstract
BACKGROUND: Protein complex identification from protein-protein interaction (PPI) networks is crucial for understanding cellular organization principles and functional mechanisms. In recent decades, numerous computational methods have been proposed to identify protein complexes. However, most of the current state-of-the-art studies still have some challenges to resolve, including their high false-positives rates, incapability of identifying overlapping complexes, lack of consideration for the inherent organization within protein complexes, and absence of some biological attachment proteins.Entities:
Keywords: Core-attachment structure; Protein complexes; Protein-protein interaction networks; Spurious interactions; Structural similarity
Mesh:
Year: 2019 PMID: 31521132 PMCID: PMC6744658 DOI: 10.1186/s12859-019-3007-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A network with two protein complexes and three overlapping proteins. Each protein complex consists of core proteins, peripheral proteins and three overlapping proteins which are shared by two protein complexes in overlapping yellow area. Additionally, these core proteins inside the red dotted circle constitute their protein complex cores. Note that diamond nodes present core proteins, circle nodes present peripheral proteins, hexagonal nodes present overlapping proteins, parall elogram nodes present interspersed proteins
Fig. 2A simple hypothetical network of 11 proteins and 15 interactions which is used for illustrating how to determine the weight of the edge e1
The details of PPI networks used in experiments
| Dataset | Number of node | Number of edge | Density |
|---|---|---|---|
| DIP | 4930 | 17202 | 0.00141572191 |
| BioGRID | 5640 | 59748 | 0.00000315987 |
| Yeast | 6194 | 74826 | 0.00390130805 |
| Human | 15459 | 144687 | 0.00121094608 |
General properties of the standard protein complexes
| Datasets | Number of protein complexes | Protein coverage | Avg size |
|---|---|---|---|
| CYC2008 | 236 | 1628 | 4.71 |
| NewMIPS | 328 | 1171 | 14.93 |
| Human complexes | 2289 | 6206 | 8.57 |
| Yeast complexes | 1045 | 2773 | 8.92 |
Performance comparison with other methods based on NewMIPS
| Algorithms | Recall | Precision | F-measure | MMR | CR |
|---|---|---|---|---|---|
| BioGRID | |||||
| MCL | 0.2896 | 0.2011 | 0.2374 | 0.0726 | 0.2995 |
| CFinder | 0.5914 | 0.1960 | 0.2944 | 0.2801 3 | 0.4402 |
| Core | 0.5609 | 0.1488 | 0.2352 | 0.1437 | 0.5882 |
| DPClus | 0.6951 | 0.1741 | 0.2785 | 0.201 | 0.5597 |
| CMC | 0.8109 1 | 0.2731 | 0.4086 | 0.3175 2 | 0.4954 |
| COACH | 0.7256 | 0.2581 | 0.3807 | 0.2525 | 0.6322 3 |
| SPICi | 0.4969 | 0.3725 | 0.4258 | 0.1304 | 0.4378 |
| ClusterONE | 0.5914 | 0.3130 | 0.4093 | 0.1917 | 0.5311 |
| PEWCC | 0.4512 | 0.5943 2 | 0.5129 3 | 0.1889 | 0.4119 |
| ProRank+ | 0.4817 | 0.7131 1 | 0.5750 2 | 0.241 | 0.4763 |
| GMFTP | 0.7530 3 | 0.2830 | 0.4114 | 0.2551 | 0.5186 |
| DPC | 0.6310 | 0.3050 | 0.4112 | 0.2312 | 0.6332 2 |
| EWCA | 0.7561 2 | 0.5821 3 |
|
|
|
| DIP | |||||
| MCL | 0.4908 | 0.1783 | 0.2616 | 0.1255 | 0.3271 |
| CFinder | 0.5762 | 0.2408 | 0.3396 | 0.2128 | 0.2403 |
| Core | 0.4420 | 0.1746 | 0.2504 | 0.1249 | 0.3902 |
| DPClus | 0.6067 3 | 0.1392 | 0.2265 | 0.1626 | 0.3356 |
| CMC | 0.5932 | 0.4152 | 0.4885 | 0.2499 2 |
|
| COACH | 0.5731 | 0.5106 2 | 0.5401 2 | 0.2006 | 0.3351 |
| SPICi | 0.4847 | 0.2473 | 0.3275 | 0.1095 | 0.3191 |
| ClusterONE | 0.4054 | 0.3020 | 0.3462 | 0.1178 | 0.2417 |
| PEWCC | 0.5670 | 0.4822 | 0.5212 3 | 0.2297 3 | 0.3280 |
| ProRank+ | 0.4085 |
| 0.5063 | 0.1669 | 0.2444 |
| GMFTP | 0.6981 2 | 0.2755 | 0.3951 | 0.2228 | 0.4043 2 |
| DPC | 0.4908 | 0.4389 | 0.4634 | 0.1717 | 0.3305 |
| EWCA |
| 0.4990 3 |
|
| 0.3982 3 |
NOTE: The highest value in each column is shown in bold
Performance comparison with other methods based on CYC2008
| Algorithms | Recall | Precision | F-measure | MMR | CR |
|---|---|---|---|---|---|
| BioGRID | |||||
| MCL | 0.3516 | 0.2268 | 0.2758 | 0.1245 | 0.5310 |
| CFinder | 0.5720 | 0.1637 | 0.2546 | 0.3115 | 0.6135 |
| Core | 0.5847 | 0.1527 | 0.2422 | 0.2081 | 0.8058 |
| DPClus | 0.7839 | 0.1978 | 0.3158 | 0.304 | 0.8160 |
| CMC |
| 0.2677 | 0.4088 |
| 0.7639 |
| COACH | 0.7669 | 0.2488 | 0.3757 | 0.3042 |
|
| SPICi | 0.5127 | 0.4039 | 0.4518 | 0.1997 | 0.6065 |
| ClusterONE | 0.6610 | 0.3487 | 0.4565 | 0.2734 | 0.7569 |
| PEWCC | 0.4025 | 0.5374 3 | 0.4603 3 | 0.2142 | 0.5431 |
| ProRank+ | 0.4153 |
| 0.5104 2 | 0.246 | 0.5850 |
| GMFTP | 0.7838 3 | 0.2914 | 0.4249 | 0.3913 3 | 0.7956 |
| DPC | 0.7033 | 0.2874 | 0.4081 | 0.2643 | 0.8616 3 |
| EWCA | 0.8093 2 | 0.5793 2 |
| 0.4351 2 | 0.8718 2 |
| DIP | |||||
| MCL | 0.5169 | 0.1847 | 0.2721 | 0.1899 | 0.4892 |
| CFinder | 0.5508 | 0.2398 | 0.3342 | 0.2788 | 0.3807 |
| Core | 0.4618 | 0.1818 | 0.2609 | 0.2033 | 0.5317 |
| DPClus | 0.6651 3 | 0.1518 | 0.2473 | 0.2610 | 0.5184 |
| CMC | 0.5932 | 0.4125 | 0.4866 | 0.2501 | 0.5755 3 |
| COACH | 0.5423 | 0.5167 3 | 0.5292 2 | 0.2764 | 0.4879 |
| SPICi | 0.5000 | 0.2769 | 0.3564 | 0.1665 | 0.4600 |
| ClusterONE | 0.4279 | 0.3343 | 0.3753 | 0.1840 | 0.3750 |
| PEWCC | 0.5296 | 0.4852 | 0.5064 3 | 0.2847 3 | 0.4682 |
| ProRank+ | 0.3771 |
| 0.4883 | 0.2029 | 0.3293 |
| GMFTP | 0.6652 2 | 0.2664 | 0.3804 | 0.3315 2 |
|
| DPC | 0.4872 | 0.4598 | 0.4731 | 0.2146 | 0.4828 |
| EWCA |
| 0.5239 2 |
|
| 0.5806 2 |
NOTE: The highest value in each column is shown in bold
Function enrichment analysis of protein complexes detected from different datasets
| Dataset | Algorithms | PC | <E-15 | [E-15,E-10) | [E-10,E-5) | [E-5,0.01) | Significant |
|---|---|---|---|---|---|---|---|
| BioGRID | CMC | 1113 | 125(11.23%) | 89(7.99%) | 258(23.18%) | 360(32.34%) | 832(74.76%) |
| PEWCC | 387 | 181(46.77%) | 64(16.53%) | 83(21.44%) | 46(11.88%) | 374(96.65%) | |
| GMFTP | 597 | 73(12.22%) | 59(9.88%) | 156(26.13%) | 161(26.96%) | 449(75.21%) | |
| COACH | 166 | 76(45.78%) | 32(19.27%) | 38(22.89%) | 16(9.63%) | 162(97.60%) | |
| ProRank+ | 746 | 479(64.20%) | 105(14.07%) | 97(13.00%) | 47(6.30%) | 18(97.59%) | |
| DPC | 2167 | 596(27.50%) | 166(7.66%) | 290(13.38%) | 569(26.25%) | 1621(74.81%) | |
| EWCA | 1388 | 658(47.40%) | 211(15.20%) | 299(21.54%) | 173(12.46%) | 1341(96.62%) | |
| DIP | CMC | 303 | 1(0.33%) | 8(2.64%) | 58(19.14%) | 77(25.41%) | 144(47.53%) |
| PEWCC | 676 | 78(11.53%) | 117(17.30%) | 278(41.12%) | 132(19.52%) | 605(89.50%) | |
| GMFTP | 548 | 43(7.84%) | 36(6.56%) | 105(19.16%) | 166(30.29%) | 350(63.69%) | |
| COACH | 329 | 21(6.38%) | 25(7.59%) | 66(20.06%) | 32(9.72%) | 144(43.68%) | |
| ProRank+ | 338 | 74(21.89%) | 77(22.78%) | 126(37.27%) | 42(12.42%) | 319(94.38%) | |
| DPC | 622 | 72(11.57%) | 113(18.16%) | 197(31.67%) | 176(28.29%) | 558(89.72%) | |
| EWCA | 964 | 188(19.50%) | 126(13.07%) | 319(33.09%) | 236(24.48%) | 870(90.15%) |
NOTE: Table 5 lists the number percentage of protein complexes detected by CMC, PEWCC, GMFTP, COACH, ProRank+, DPC and EWCA in the PPI network whose p-value falls within different value ranges. In order to analyze functional enrichment, we should take into account of two values. For example, in the DIP dataset, in the fourth column of the fourteenth row 188 times 19.50% is 36.66 which is the highest value in this column that means EWCA is the best among these methods. Here, from the fourth column to the seventh column the larger value is, the better functional enrichment is
Some example of identified complexes with low p-value detected by EWCA method on different datasets
| Dataset | ID | Cluster frequency | Gene ontology term | |
|---|---|---|---|---|
| BioGRID | 1 | 8.83e-108 | 62 of 66 genes, 93.9% | mRNA splicing, via spliceosome |
| 2 | 2.68e-106 | 70 of 71 genes, 98.6% | cytoplasmic translation | |
| 3 | 1.09e-80 | 78 of 92 genes, 84.8% | chromatin organization | |
| 4 | 2.11e-72 | 55 of 88 genes, 62.5% | ribosomal large subunit biogenesis | |
| 5 | 2.48e-78 | 83 of 102 genes, 81.4% | ribosome biogenesis | |
| DIP | 1 | 4.62e-32 | 14 of 16 genes, 87.5% | mRNA polyadenylation |
| 2 | 1.54e-31 | 24 of 25 genes, 96.0% | mRNA processing | |
| 3 | 2.96e-25 | 15 of 23 genes, 65.2% | maturation of LSU-rRNA from tricistronic rRNA transcript | |
| 4 | 1.80e-28 | 16 of 18 genes, 88.9% | histone acetylation | |
| 5 | 5.58e-29 | 12 of 13 genes, 92.3% | ATP biosynthetic process |
Ten protein complexes with cluster frequency being 100% on different datasets
| Datasets | ID | Cluster Frequency | Gene ontology term | |
|---|---|---|---|---|
| BioGRID | 1 | 1.76e-75 | 46 of 46 genes, 100.0% | RNA splicing |
| 2 | 1.42e-43 | 16 of 16 genes, 100.0% | tRNA transcription | |
| 3 | 5.77e-40 | 23 of 23 genes, 100.0% | mRNA transport | |
| 4 | 1.36e-32 | 14 of 14 genes, 100.0% | ergosterol biosynthetic process | |
| 5 | 2.24e-30 | 20 of 20 genes, 100.0% | DNA replication | |
| DIP | 1 | 4.68e-26 | 10 of 10 genes, 100.0% | anaphase-promoting complex-dependent catabolic process |
| 2 | 1.06e-31 | 19 of 19 genes, 100.0% | mRNA splicing, via spliceosome | |
| 3 | 7.37e-27 | 21 of 21 genes, 100.0% | mRNA metabolic process | |
| 4 | 8.64e-24 | 15 of 15 genes, 100.0% | mitochondrial translation | |
| 5 | 2.51e-19 | 10 of 10 genes, 100.0% | ncRNA transcription |
Fig. 3The effect of ss. Performance of EWCA on protein complex identification with different values of structural similarity threshold values of ss is measured by all evaluation meterics, with respect to CYC2008 and NewMIPS standard complex sets. The x-axis denotes the value of structural similarity and the y-axis denotes some evaluation metrics in DIP dataset. The F-measure is maximised at ss=0.4 for unweighted DIP dataset
Fig. 4The effect of ss. Performance of EWCA with different structural similarity threshold ss is measured by all evaluation meterics, with respect to CYC2008 and NewMIPS standard complex sets. The x-axis denotes the value of structural similarity and the y-axis denotes evaluation metrics in BioGRID dataset. The F-measure is maximised at ss=0.4 on unweighted BioGRID dataset
Accuracy and running time by different algorithms on Human and Yeast datasets using Human complexes and Yeast complexes as standard complexes
| Dataset | Algorithms | PC | F-measure | MMR | CR | Running time/s |
|---|---|---|---|---|---|---|
| Human | PEWCC | 2930 | 0.39552 | 0.09632 | 0.5155 | 83.05 s 2 |
| COACH | 4484 | 0.2455 | 0.0677 |
| 2851 s | |
| ProRank+ | 838 | 0.3651 | 0.0687 | 0.2856 | 282.66 s | |
| EWCA | 1979 |
|
| 0.52212 |
| |
| Yeast | PEWCC | 1353 | 0.3446 2 | 0.0871 2 | 0.4946 | 36.58 s 2 |
| COACH | 1547 | 0.2083 | 0.0466 | 0.5520 2 | 3603.31 s | |
| ProRank+ | 513 | 0.2712 | 0.0487 | 0.2816 | 251.54 s | |
| EWCA | 924 |
|
|
|
|
As the table shows, EWCA obtains best F-measure, MMR and Running time in all the two datasets. Given the results of F-measure, it shows the accuracy of protein complexes identified by EWCA is better than these comparison algorithms. The results of Running time, it is said the efficient of EWCA is faster than those algorithms. In a word, EWCA could both accuracy and efficient than some state-of-the-art algorithms with having a higher accuracy according to Tables 3 and 4. NOTE: The highest value in each row is shown in bold