| Literature DB >> 30241459 |
Xiaoxia Liu1, Zhihao Yang2, Shengtian Sang1, Ziwei Zhou1, Lei Wang3, Yin Zhang4, Hongfei Lin1, Jian Wang1, Bo Xu5.
Abstract
BACKGROUND: Protein complexes are one of the keys to deciphering the behavior of a cell system. During the past decade, most computational approaches used to identify protein complexes have been based on discovering densely connected subgraphs in protein-protein interaction (PPI) networks. However, many true complexes are not dense subgraphs and these approaches show limited performances for detecting protein complexes from PPI networks.Entities:
Keywords: Node embeddings; Protein complex detection; Random forest; Supervised learning method
Mesh:
Substances:
Year: 2018 PMID: 30241459 PMCID: PMC6150962 DOI: 10.1186/s12859-018-2364-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The overall workflow of NodeEmbed-SLPC-RF method. a P1,P2,P3,P4,P5 and P6 are the proteins in the PPI network, and P1,P5 and P6 compose a protein complex. b The red node in the left network is the seed node, and the nodes in slash circles of the right network is a candidate protein complex discovered by using SLPC model
Performance comparison results on HPRD and DIP datasets
| Methods | No. of complexes | Precision | Recall | F-score |
|---|---|---|---|---|
| HPRD | ||||
| ClusterONE | 789 | 0.2307 | 0.1724 | 0.1973 |
| MCODE | 102 | 0.2059 | 0.0258 | 0.0458 |
| MCL | 1291 | 0.1255 | 0.1704 | 0.1445 |
| CMC | 44 | 0.3636 | 0.0178 | 0.0340 |
| Coach | 1762 | 0.2469 | 0.3890 | 0.3021 |
| ProRank+ | 500 | 0.2820 | 0.1625 | 0.2062 |
| PEWCC | 1194 | 0.2739 | 0.2299 | 0.2499 |
| SLPC only | 2713 | 0.3693 | 0.4901 | 0.4212 |
| d=32 | 858 | 0.7005 | 0.3785 | 0.4914 |
| d=64 | 871 | 0.7107 | 0.3983 |
|
| d=128 | 841 | 0.7099 | 0.3890 | 0.5026 |
| d=256 | 882 | 0.6961 | 0.3877 | 0.4980 |
| d=512 | 823 | 0.7096 | 0.3831 | 0.4976 |
| d=1024 | 867 | 0.7105 | 0.3970 | 0.5093 |
| DIP | ||||
| ClusterONE | 363 | 0.5069 | 0.4012 | 0.4479 |
| MCODE | 82 | 0.0244 | 0.0030 | 0.0053 |
| MCL | 436 | 0.3463 | 0.3952 | 0.3692 |
| CMC | 262 | 0.4389 | 0.2912 | 0.3501 |
| Coach | 747 | 0.4351 | 0.5156 | 0.4719 |
| ProRank+ | 167 | 0.4731 | 0.1516 | 0.2296 |
| PEWCC | 666 | 0.5916 | 0.3744 | 0.4586 |
| SLPC only | 1061 | 0.6447 | 0.4829 | 0.5522 |
| d=32 | 719 | 0.8108 | 0.4428 | 0.5728 |
| d=64 | 710 | 0.8070 | 0.4473 |
|
| d=128 | 702 | 0.8148 | 0.4368 | 0.5688 |
| d=256 | 708 | 0.8263 | 0.4413 | 0.5753 |
| d=512 | 711 | 0.8158 | 0.4413 | 0.5728 |
| d=1024 | 691 | 0.8249 | 0.4354 | 0.5699 |
d denotes the dimension of each vector. No. of complexes denotes the total number of predicted complexes by each method. Bold value denotes the best score corresponding to F-score
Performance comparison results on DIP datasets using the MIPS gold standard
| Methods | Precision | Recall | F-score |
|---|---|---|---|
| Ours | 0.893 | 0.581 |
|
| SPLC only | 0.419 | 0.670 | 0.514 |
| ClusterEPs | 0.649 | 0.751 | 0.695 |
| SCI-BN | 0.273 | 0.473 | 0.346 |
| NN | 0.333 | 0.491 | 0.397 |
Bold value denotes the best score corresponding to F-score. Ours denotes the NodeEmbed-SLPC-RF method
Fig. 2The performance comparison in terms of F-score obtained by SVM, LR and RF with different dimensions on a HPRD and b DIP
Fig. 3The numbers of edges left after filtering by using different simi-thres on HPRD and DIP. a HPRD. b DIP
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by filtering edges with different simi-thres
| Simi-thres | No. of edges left | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|---|
| 0.80 | 36164 | 999 | 0.6617 | 0.4181 | 0.5124 | +0.0912 |
| 0.81 | 36009 | 999 | 0.6547 | 0.4306 | 0.5195 | +0.0983 |
| 0.82 | 35869 | 1019 | 0.6487 | 0.4280 | 0.5157 | +0.0945 |
| 0.83 | 35710 | 1006 | 0.6531 | 0.4293 | 0.5181 | +0.0969 |
| 0.84 | 35523 | 999 | 0.6607 | 0.4326 | 0.5229 | +0.1017 |
| 0.85 | 35311 | 992 | 0.6552 | 0.4359 | 0.5235 | +0.1023 |
| 0.86 | 35117 | 992 | 0.6673 | 0.4326 |
|
|
| 0.87 | 34887 | 979 | 0.6599 | 0.4313 | 0.5216 | +0.1004 |
| 0.88 | 34621 | 975 | 0.6728 | 0.4221 | 0.5187 | +0.0975 |
| 0.89 | 34278 | 950 | 0.6505 | 0.4207 | 0.5110 | +0.0898 |
| 0.90 | 33921 | 943 | 0.6585 | 0.4221 | 0.5144 | +0.0932 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by filtering edges with different simi-thres
| Simi-thres | No. of edges left | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|---|
| 0.65 | 12167 | 653 | 0.8760 | 0.4413 | 0.5869 | +0.0347 |
| 0.66 | 11941 | 667 | 0.8726 | 0.4428 |
|
|
| 0.67 | 11683 | 652 | 0.8712 | 0.4368 | 0.5819 | +0.0297 |
| 0.68 | 11423 | 634 | 0.8801 | 0.4324 | 0.5799 | +0.0277 |
| 0.69 | 11174 | 617 | 0.8995 | 0.4294 | 0.5813 | +0.0291 |
| 0.70 | 10946 | 612 | 0.9020 | 0.4235 | 0.5764 | +0.0242 |
| 0.71 | 10673 | 610 | 0.8918 | 0.4235 | 0.5743 | +0.0221 |
| 0.72 | 10410 | 616 | 0.8929 | 0.4235 | 0.5745 | +0.0223 |
| 0.73 | 10184 | 622 | 0.8939 | 0.4264 | 0.5774 | +0.0252 |
| 0.74 | 9907 | 608 | 0.8947 | 0.4160 | 0.5680 | +0.0158 |
| 0.75 | 9633 | 594 | 0.9091 | 0.4190 | 0.5736 | +0.0214 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric
Fig. 4The numbers of edges added by using different simi-thres on HPRD and DIP. a HPRD. b DIP
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by adding edges with different simi-thres
| Simi-thres | No. of added edges | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|---|
| 0.65 | 7889 | 717 | 0.6137 | 0.2893 | 0.3932 | -0.0280 |
| 0.66 | 7572 | 824 | 0.6104 | 0.3454 | 0.4412 | +0.0200 |
| 0.67 | 7174 | 829 | 0.6164 | 0.3487 | 0.4455 | +0.0243 |
| 0.68 | 6531 | 940 | 0.6266 | 0.3983 | 0.4870 | +0.0658 |
| 0.69 | 5546 | 952 | 0.6313 | 0.4003 | 0.4899 | +0.0687 |
| 0.70 | 4121 | 1030 | 0.6544 | 0.4168 | 0.5092 | +0.0880 |
| 0.71 | 2522 | 1021 | 0.6513 | 0.4148 | 0.5068 | +0.0856 |
| 0.72 | 1390 | 1028 | 0.6566 | 0.4207 |
|
|
| 0.73 | 850 | 1015 | 0.6611 | 0.4155 | 0.5102 | +0.0890 |
| 0.74 | 583 | 1024 | 0.6563 | 0.4188 | 0.5113 | +0.0901 |
| 0.75 | 447 | 1017 | 0.6608 | 0.4168 | 0.5111 | +0.0899 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by adding edges with different simi-thres
| Simi-thres | No. of added edges | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|---|
| 0.35 | 3351 | 702 | 0.8305 | 0.4428 | 0.5776 | +0.0254 |
| 0.36 | 3153 | 707 | 0.8317 | 0.4458 | 0.5804 | +0.0282 |
| 0.37 | 2979 | 698 | 0.8295 | 0.4398 | 0.5748 | +0.0226 |
| 0.38 | 2784 | 696 | 0.8290 | 0.4473 |
|
|
| 0.39 | 2586 | 691 | 0.8234 | 0.4413 | 0.5746 | +0.0224 |
| 0.40 | 2378 | 677 | 0.8198 | 0.4339 | 0.5674 | +0.0152 |
| 0.41 | 2196 | 685 | 0.8161 | 0.4354 | 0.5678 | +0.0156 |
| 0.42 | 2019 | 698 | 0.8095 | 0.4383 | 0.5687 | +0.0165 |
| 0.43 | 1831 | 689 | 0.8084 | 0.4339 | 0.5647 | +0.0125 |
| 0.44 | 1634 | 703 | 0.8108 | 0.4413 | 0.5715 | +0.0193 |
| 0.45 | 1473 | 710 | 0.8056 | 0.4413 | 0.5702 | +0.0180 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric
Comparison results for link prediction on HPRD and DIP
| Method | Mean ranking | Hits@1 | Hits@10 | Hits@50 |
|---|---|---|---|---|
| HPRD | ||||
| random | 52.87 | 2 | 7.8 | 47.8 |
| node2vec |
|
| 53.4 |
|
| PE | 35.53 | 25.64 | 52.14 | 70.09 |
| AdjustCD | 35.07 | 23.93 |
| 68.38 |
| DIP | ||||
| random | 49.01 | 2.8 | 10.8 | 51.4 |
| node2vec |
|
|
|
|
| PE | 30.73 | 3.8 | 29.8 | 75.4 |
| AdjustCD | 29.03 | 8.8 | 37.8 | 77.4 |
Bold values denote the best scores corresponding to the specific metric. The value of each column in terms of Hit@N with different N is the percentage of true edges ranked in top N
Comparison results for link prediction with different dimensions by using node2vec on HPRD and DIP
| Dimension | Mean ranking | Hits@1 | Hits@10 | Hits@50 |
|---|---|---|---|---|
| HPRD | ||||
| d=32 | 25.37 | 28.6 | 51.2 | 76.4 |
| d=64 |
|
|
|
|
| d=128 | 25.83 | 27.6 | 52.8 | 76.2 |
| d=256 | 27.62 | 26.6 | 49 | 74 |
| d=512 | 27.74 | 27.8 | 47.2 | 75.4 |
| d=1024 | 27.22 | 25.8 | 50 | 74.2 |
| DIP | ||||
| d=32 | 12.76 | 54 | 74.4 | 89.2 |
| d=64 |
|
|
| 91.4 |
| d=128 | 11.45 | 59 | 80.2 | 90 |
| d=256 | 10.77 | 57 | 79 |
|
| d=512 | 11.19 | 54.4 | 79 | 90.6 |
| d=1024 | 10.65 | 52 | 77 | 91.2 |
Bold values denote the best scores corresponding to the specific metric. The value of each column in terms of Hit@N with different N is the percentage of true edges ranked in top N
Performance comparison using different vector generation strategies on HPRD and DIP datasets
| Methods | No. of complexes | Precision | Recall | F-score |
|---|---|---|---|---|
| HPRD | ||||
| Max | 871 | 0.7107 | 0.3983 |
|
| Min | 854 | 0.7037 | 0.3824 | 0.4956 |
| Average | 937 | 0.6126 | 0.354 | 0.4487 |
| DIP | ||||
| Max | 710 | 0.8070 | 0.4473 |
|
| Min | 701 | 0.8160 | 0.4368 | 0.5690 |
| Average | 698 | 0.8181 | 0.4354 | 0.5683 |
Bold value denotes the best score corresponding to F-score. Max denotes selecting the max value of each column of the matrix Z which is composed by the corresponding node embeddings in the complex. Min denotes selecting the min value of each column of the matrix Z. Average denotes getting the average value of each column of the matrix Z
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified HPRD network by filtering edges first and then adding edges with different simi-thres
| Simi-thres | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|
| fixing filtering sime-thres to 0.86 | |||||
| 0.86_0.65 | 1018 | 0.6234 | 0.3547 | 0.4521 | +0.0309 |
| 0.86_0.66 | 1137 | 0.5638 | 0.4075 | 0.4731 | +0.0519 |
| 0.86_0.68 | 1151 | 0.5656 | 0.4095 | 0.4751 | +0.0539 |
| 0.86_0.67 | 874 | 0.6545 | 0.4135 | 0.5068 | +0.0856 |
| 0.86_0.69 | 868 | 0.6544 | 0.4135 | 0.5068 | +0.0856 |
| 0.86_0.70 | 872 | 0.6514 | 0.4148 | 0.5068 | +0.0856 |
| 0.86_0.71 | 872 | 0.6560 | 0.4135 | 0.5072 | +0.0860 |
| 0.86_0.72 | 952 | 0.6702 | 0.4293 | 0.5234 | +0.1022 |
| 0.86_0.73 | 967 | 0.6660 | 0.4267 | 0.5201 | +0.0989 |
| 0.86_0.74 | 981 | 0.6758 | 0.4293 |
|
|
| 0.86_0.75 | 978 | 0.6708 | 0.4300 | 0.5240 | +0.1028 |
| fixing adding sime-thres to 0.72 | |||||
| 0.80_0.72 | 903 | 0.6755 | 0.4062 | 0.5073 | +0.0861 |
| 0.81_0.72 | 897 | 0.6778 | 0.4188 | 0.5177 | +0.0965 |
| 0.82_0.72 | 975 | 0.5908 | 0.3791 | 0.4619 | +0.0407 |
| 0.83_0.72 | 905 | 0.6862 | 0.4221 | 0.5226 | +0.1014 |
| 0.84_0.72 | 888 | 0.6926 | 0.4194 | 0.5224 | +0.1012 |
| 0.85_0.72 | 907 | 0.6880 | 0.4221 | 0.5232 | +0.1020 |
| 0.86_0.72 | 952 | 0.6702 | 0.4293 |
|
|
| 0.87_0.72 | 871 | 0.6820 | 0.4161 | 0.5169 | +0.0957 |
| 0.88_0.72 | 890 | 0.6685 | 0.4102 | 0.5084 | +0.0872 |
| 0.89_0.72 | 853 | 0.6694 | 0.4089 | 0.5076 | +0.0864 |
| 0.90_0.72 | 856 | 0.6600 | 0.4055 | 0.5024 | +0.0812 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric
Experimental results obtained by using RF to filter the candidate complexes which are predicted from the modified DIP network by filtering edges with simi-thres 0.66 first and then adding edges with different simi-thres
| Simi-thres | No. of complexes | Precision | Recall | F-score |
|
|---|---|---|---|---|---|
| fixing filtering sime-thres to 0.66 | |||||
| 0.66_0.35 | 665 | 0.8797 | 0.4428 | 0.5891 | +0.0369 |
| 0.66_0.36 | 659 | 0.8786 | 0.4398 | 0.5862 | +0.0340 |
| 0.66_0.37 | 667 | 0.8741 | 0.4398 | 0.5852 | +0.0330 |
| 0.66_0.38 | 660 | 0.8758 | 0.4443 | 0.5895 | +0.0373 |
| 0.66_0.39 | 673 | 0.8678 | 0.4413 | 0.5851 | +0.0329 |
| 0.66_0.40 | 669 | 0.8789 | 0.4443 |
|
|
| 0.66_0.41 | 669 | 0.8714 | 0.4428 | 0.5872 | +0.0350 |
| 0.66_0.42 | 667 | 0.8786 | 0.4428 | 0.5888 | +0.0366 |
| 0.66_0.43 | 672 | 0.8705 | 0.4398 | 0.5844 | +0.0322 |
| 0.66_0.44 | 667 | 0.8741 | 0.4413 | 0.5865 | +0.0343 |
| 0.66_0.45 | 667 | 0.8771 | 0.4398 | 0.5859 | +0.0337 |
| fixing adding sime-thres to 0.38 | |||||
| 0.65_0.38 | 594 | 0.8064 | 0.4086 | 0.5424 | -0.0098 |
| 0.66_0.38 | 660 | 0.8758 | 0.4443 |
|
|
| 0.67_0.38 | 707 | 0.7765 | 0.4250 | 0.5493 | -0.0029 |
| 0.68_0.38 | 681 | 0.7797 | 0.4190 | 0.5451 | -0.0071 |
| 0.69_0.38 | 687 | 0.7729 | 0.4160 | 0.5409 | -0.0113 |
| 0.70_0.38 | 676 | 0.7678 | 0.4086 | 0.5334 | -0.0188 |
| 0.71_0.38 | 664 | 0.7636 | 0.4071 | 0.5311 | -0.0211 |
| 0.72_0.38 | 678 | 0.7478 | 0.4071 | 0.5272 | -0.0250 |
| 0.73_0.38 | 677 | 0.7518 | 0.4042 | 0.5257 | -0.0265 |
| 0.74_0.38 | 678 | 0.7552 | 0.4012 | 0.5240 | -0.0282 |
| 0.75_0.38 | 655 | 0.7588 | 0.3923 | 0.5172 | -0.0350 |
Δ denotes the improvement of F-score compare with using SLPC alone. Bold values denote the best scores corresponding to the specific metric