| Literature DB >> 25243127 |
Nana Jin1, Deng Wu2, Yonghui Gong2, Xiaoman Bi2, Hong Jiang3, Kongning Li2, Qianghu Wang1.
Abstract
An increasing number of experiments have been designed to detect intracellular and intercellular molecular interactions. Based on these molecular interactions (especially protein interactions), molecular networks have been built for using in several typical applications, such as the discovery of new disease genes and the identification of drug targets and molecular complexes. Because the data are incomplete and a considerable number of false-positive interactions exist, protein interactions from different sources are commonly integrated in network analyses to build a stable molecular network. Although various types of integration strategies are being applied in current studies, the topological properties of the networks from these different integration strategies, especially typical applications based on these network integration strategies, have not been rigorously evaluated. In this paper, systematic analyses were performed to evaluate 11 frequently used methods using two types of integration strategies: empirical and machine learning methods. The topological properties of the networks of these different integration strategies were found to significantly differ. Moreover, these networks were found to dramatically affect the outcomes of typical applications, such as disease gene predictions, drug target detections, and molecular complex identifications. The analysis presented in this paper could provide an important basis for future network-based biological researches.Entities:
Mesh:
Year: 2014 PMID: 25243127 PMCID: PMC4163410 DOI: 10.1155/2014/296349
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Seven machine learning classifiers constructed by using the gold standard datasets. The gold standard datasets, positive protein-protein interactions in the DIP database, and negative protein-protein interactions in the NEGATOME database were used to construct the seven machine learning classifiers based on the following methods: Naive Bayes, Bayesian Networks, Logistic Regression, Support Vector Machine (SVM), Random Tree, Random Forest, and J48.
Figure 2ROC curves for seven machine learning integration strategies using 10-fold cross-validation against the gold standard datasets. Each point on the ROC curves of the seven integration strategies is created by the unique sensitivity and specificity against a specific likelihood ratio cut-off. Each name of the curve derived from the different integration strategies is shown in the legend. The different colours stand for the different curves for the different strategies. The area under the curve is also presented in the figure. Sensitivity and specificity are calculated during the 10-fold cross-validations.
Performance of the classifiers constructed by seven machine learning integration strategies.
| Strategy | ACC | AUC | Precision | Recall | FP rate | TP/FP |
|---|---|---|---|---|---|---|
| Naive Bayes | 0.5391 | 0.62 | 0.524 | 0.539 | 0.518 | 1.041 |
| Bayesian Networks | 0.6325 | 0.736 | 0.683 | 0.632 | 0.418 | 1.512 |
| Logistic Regression | 0.7188 | 0.772 | 0.724 | 0.719 | 0.275 | 2.615 |
| SVM | 0.7144 | 0.723 |
| 0.714 |
|
|
| Random Tree | 0.6568 | 0.648 | 0.656 | 0.657 | 0.35 | 1.877 |
| Random Forest |
|
| 0.72 |
| 0.292 | 2.466 |
| J48 | 0.6808 | 0.671 | 0.681 | 0.681 | 0.323 | 2.108 |
Note: ACC stands for the accuracy of the correctly classified items (after a 10-fold cross-validation). AUC indicates the area under the ROC curve. Precision is the number of true positives divided by the total number of elements labelled as belonging to the positive class. Recall (also referred to as the True Positive Rate) represents the number of true positives divided by the total number of elements that actually belong to the positive class. The FP rate indicates the false positive rate. TP/FP reveals the true positive to the false positive ratio. Bold type indicates the maximum value in the ACC, AUC, Precision, Recall, and TP/FP columns and indicates the minimum value in the FP rate column.
Figure 3Eleven new networks built by empirical and machine learning integration strategies. Eleven new networks constructed by funnel-like empirical and machine learning integration strategies, namely, Union, Intersection, 2-Vote, 3-Vote, Naive Bayes, Bayesian Networks, Logistic Regression, SVM, Random Tree, Random Forest, and J48, from the entire set of data in the IntAct, MINT, HPRD, BIND, and BioGRID databases.
The coverage of each network built by 11 integration strategies.
| Strategy | Number | Coverage |
|---|---|---|
| Union | 145534 | 1 |
| Intersection | 497 | 0.0034 |
| 2-Vote | 40766 | 0.2801 |
| 3-Vote | 12891 | 0.0886 |
| Naive Bayes | 134095 | 0.9214 |
| Bayesian Networks | 140956 | 0.9685 |
| Logistic Regression | 140746 | 0.9671 |
| SVM | 142226 | 0.9773 |
| Random Tree | 114598 | 0.7874 |
| Random Forest | 139082 | 0.9557 |
| J48 | 120541 | 0.8283 |
Note: Number stands for the number of the predicted interaction pairs by each integration strategy. Percentage represents the ratio of the number of the predicted interaction pairs to the number of total interaction pairs in the five databases.
The duplication of seven machine learning networks and all 11 integration networks.
| Seven machine learning networks | All 11 integration networks | ||||
|---|---|---|---|---|---|
| DT | Number | Percentage | DT | Number | Percentage |
| 0 | 1808 | 1.24% | 1 | 1683 | 1.16% |
| 1 | 1277 | 0.88% | 2 | 134 | 0.09% |
| 2 | 79 | 0.05% | 3 | 718 | 0.49% |
| 3 | 299 | 0.21% | 4 | 777 | 0.53% |
| 4 | 1410 | 0.97% | 5 | 1234 | 0.85% |
| 5 | 9434 | 6.48% | 6 | 7653 | 5.26% |
| 6 | 41487 | 28.51% | 7 | 32465 | 22.31% |
| 7 | 89740 | 61.66% | 8 | 71478 | 49.11% |
| 9 | 20680 | 14.21% | |||
| 10 | 8400 | 5.77% | |||
| 11 | 312 | 0.22% | |||
Note: DT stands for the number of times in which all of the interactions were duplicated. Number represents the number of such interactions. Percentage reveals the ratio of the number of such interactions to the total number in the original network.
The topological properties of the 11 new empirical and machine learning networks.
| Empirical | Machine learning | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Union | Intersection | 2-Vote | 3-Vote | Naive Bayes | Bayesian Networks | Logistic Regression | SVM | Random Tree | Random Forest | J48 | |
| Proteins |
| 507 | 9548 | 5558 | 14840 | 14869 | 14890 |
| 14486 | 14860 | 14570 |
| Interactions |
| 497 | 40766 | 12891 | 134095 | 140956 | 140746 |
| 114598 | 139082 | 120541 |
| Diameter | 15 | 6 |
| 15 | 15 | 15 |
|
|
| 15 |
|
| Degree |
| 1.96 | 8.54 | 4.64 | 18.07 | 18.96 | 18.90 |
| 15.82 | 18.72 | 16.55 |
| Density | 0.00130 |
| 0.00089 | 0.00083 | 0.00122 |
| 0.00127 |
| 0.00109 | 0.00126 | 0.00114 |
| ASP | 2.9216 |
| 4.4487 | 4.7394 | 2.9372 | 2.9245 | 2.9256 |
| 3.0447 | 2.9280 | 3.0103 |
| CC | 0.0206 |
| 0.0471 | 0.0340 | 0.0161 |
| 0.0197 |
| 0.0156 | 0.0194 | 0.0170 |
Note: Proteins, Interactions, Diameter, Degree, and Density indicate the number of proteins, the number of interactions, the network diameter, the average degree and the network density, respectively. ASP and CC are the average path length and clustering coefficient, respectively. Bold type indicates the minimum value for average path length and the maximum value for the other topological properties of the empirical and machine learning methods.
Description of the top genes in 11 integration networks from the detection of disease genes based on a phenotype similarity study.
| Strategy | Gene symbol | Official full name |
|---|---|---|
| Union | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Intersection | MYD88 | Myeloid differentiation primary response 88 |
| 2-Vote | TGFBR2 | Transforming growth factor, beta receptor II (70/80 kDa) |
| 3-Vote | TGFBR2 | Transforming growth factor, beta receptor II (70/80 kDa) |
| SVM | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Naive Bayes | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Random Tree | GRM7 | Glutamate receptor, metabotropic 7 |
| J48 | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Logistic Regression | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Random Forest | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
| Bayesian Networks | ATP2B2 | ATPase, Ca++ transporting, plasma membrane 2 |
The performance of the detection of disease genes using RWR.
| Strategy | Rank | Nodes | Rank ratio |
|---|---|---|---|
| Naive Bayes | 2065.86 | 14840 | 0.1392 |
| Logistic Regression | 2113.31 | 14890 | 0.1419 |
| SVM | 2133.86 | 14895 | 0.1433 |
| Union | 2146.21 | 14936 | 0.1437 |
| Random Forest | 2136.90 | 14860 | 0.1438 |
| Bayesian Networks | 2139.83 | 14869 | 0.1439 |
| J48 | 2370.10 | 14570 | 0.1627 |
| Random Tree | 2697.93 | 14486 | 0.1862 |
| 3-Vote | 1306.61 | 5558 | 0.2351 |
| 2-Vote | 2245.89 | 9548 | 0.2352 |
| Intersection | 123.5 | 507 | 0.2436 |
Note: Rank indicates the average rank of the nonseed genes in several repeated experiments; the number of repetitions depended on the number of remaining genes. The rank ratio reveals the average rank divided by the total number of nodes in each network. The rank ratio was used to evaluate whether the performance of the integration strategy was outstanding. The smaller the scale is, the better the integration strategy is.
Figure 4The performance of detecting disease gene using RWR. We used a box-plot to show the rank difference between each of the 11 integration strategies. Apparent distinctions exist between the different networks by different integration strategies.
The top 10 genes of all of the genes, except for the seed genes, from 11 integration networks in the detection of disease genes using RWR.
| Strategy | The symbols of the top 10 genes |
|---|---|
| Union | UBC, TAF1, MYC, HNF4A, SMARCA4, ELAVL1, CDK2, FASLG, XRCC6, and SDHA |
|
| |
| Intersection | YWHAB, RAD50, CTNNB1, GRB2, SHC1, ABL1, YWHAZ, YWHAE, ERBB2, and RB1 |
|
| |
| 2-Vote | MLH1, PTPN6, XRCC6, EXO1, ARHGDIA, VAV3, HRAS, FASLG, APP, and TNIK |
|
| |
| 3-Vote | PTPN6, MAX, ZHX1, CCDC90B, MLH1, EXO1, IMMT, VIM, ASF1B, and ASF1A |
|
| |
| SVM | UBC, TAF1, MYC, HNF4A, SMARCA4, ELAVL1, CDK2, FASLG, XRCC6, and SDHA |
|
| |
| Naive Bayes | UBC, TAF1, MYC, HNF4A, SMARCA4, ELAVL1, CDK2, XRCC6, FASLG, and SDHA |
|
| |
| Random Tree | UBC, MYC, XRCC6, SMARCA4, ARHGDIA, TAF1, ABL1, ELAVL1, FASLG, and CDK2 |
|
| |
| J48 | UBC, TAF1, MYC, HNF4A, SMARCA4, XRCC6, ELAVL1, FASLG, CDK2, and DTNBP1 |
|
| |
| Logistic Regression | UBC, TAF1, MYC, HNF4A, SMARCA4, XRCC6, ELAVL1, CDK2, FASLG, ARHGDIA |
|
| |
| Random Forest | UBC, TAF1, MYC, HNF4A, SMARCA4, ELAVL1, CDK2, FASLG, XRCC6, and SDHA |
|
| |
| Bayesian Networks | UBC, TAF1, MYC, HNF4A, SMARCA4, ELAVL1, CDK2, FASLG, XRCC6, and SDHA |
Note: the description of these genes was listed in Supplementary Table S1 of the Supplementary Material available online at http://dx.doi.org/10.1155/2014/296349.
Performance of each network built by integration strategies for the discovery of drug targets based on topological properties.
| Strategy | 1N | Target | Node | Target ratio |
|---|---|---|---|---|
| Union | 82 |
|
| 0.132833 |
| Intersection | 40 | 133 | 507 |
|
| 2-Vote | 79 | 1464 | 9548 | 0.153331 |
| 3-Vote | 55 | 885 | 5558 | 0.15923 |
| SVM | 83 | 1974 | 14895 | 0.132528 |
| Naive Bayes | 83 | 1969 | 14840 | 0.132682 |
| Random Tree | 81 | 1941 | 14486 | 0.133991 |
| J48 | 81 | 1945 | 14570 | 0.133493 |
| Logistic Regression |
| 1981 | 14890 | 0.133042 |
| Random Forest | 83 | 1972 | 14860 | 0.132705 |
| Bayesian Networks | 83 | 1973 | 14869 | 0.132692 |
Note: 1N indicates the number of targets included in the top 100 proteins. Target indicates the number of targets in the network. The target ratio reveals the percentage of targets in a network. The bold type indicates the maximum values in the 1N, Target, Node, and Target ratio columns.
The duplication of targets in the top 100 in each network built by all 11 integration strategies.
| DT | Number | Percentage |
|---|---|---|
| 1 | 58 | 0.3391 |
| 2 | 21 | 0.1228 |
| 3 | 5 | 0.0292 |
| 4 | 4 | 0.0234 |
| 5 | 6 | 0.0351 |
| 6 | 3 | 0.0175 |
| 7 | 6 | 0.0351 |
| 8 | 27 | 0.1579 |
| 9 | 21 | 0.1228 |
| 10 | 11 | 0.0643 |
| 11 | 9 | 0.0526 |
Note: DT indicates the duplication times of the targets that appear in the top 100 of each network. Number represents the number of targets. Percentage reveals the ratio of the number of targets to the total of all of the targets that appear in the top 100 of each network.
The topology properties of the molecular complexes found by 11 networks built by integration strategies based on the MCODE clustering algorithm.
| Empirical | Machine learning | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Union | Intersection | 2-Vote | 3-Vote | Naive Bayes | Bayesian Networks | Logistic Regression | SVM | Random Tree | Random Forest | J48 | |
| Proteins |
| 5 | 26 | 29 | 55 |
| 64 | 61 | 40 | 59 | 48 |
| Interactions |
| 9 | 169 | 92 | 438 |
| 1761 | 1628 | 702 | 1497 | 1028 |
| Diameter | 2 | 2 | 2 |
|
| 2 | 2 | 2 | 2 | 2 | 2 |
| Degree |
| 3.6 | 12.769 | 5.241 | 15.927 |
| 54.844 | 53.377 | 35.1 | 50.746 | 42.833 |
| Density |
| 0.9 | 0.511 | 0.187 | 0.295 | 0.859 | 0.871 | 0.890 | 0.9 | 0.875 |
|
| ASP | 1.127 |
| 1.489 | 3.264 | 2.773 | 1.141 | 1.129 | 1.110 | 1.1 | 1.125 |
|
| CC | 0.903 | 0.9 |
| 0.816 | 0.851 | 0.894 | 0.899 | 0.913 | 0.904 | 0.895 |
|
Note: Proteins, Interactions, Diameter, Degree, and Density indicate the number of proteins, the number of interactions, network diameter, average degree, and network density, respectively. ASP and CC are the average path length and clustering coefficient, respectively. Bold type indicates the minimum value on an average path length and the maximum value in the other topological properties of empirical and machine learning methods.
Gene symbol and degree of the proteins that have the largest degree in every molecular complex of each network.
| Strategies | Gene symbol | Degree |
|---|---|---|
| Union | RPL5, UBC | 64 |
|
| ||
| Intersection | IRAK1, IRAK2, and IRAK3 | 4 |
|
| ||
| 2-Vote | UCHL5 | 25 |
|
| ||
| 3-Vote | IKBKG | 10 |
|
| ||
| SVM | RPS8, RPS2, RPL5, RPL11, RPL18, RPS16, RPS6, RPL19, RPS13, RPL21, RPL6, RPL10A, UBC, RPS4X, RPL4, and RPS3 | 60 |
|
| ||
| Naive Bayes | MED26, MED29 | 27 |
|
| ||
| Random Tree | RPL5, UBC, and RPL4 | 39 |
|
| ||
| J48 | RPS2, RPL5, RPL11, RPS6, RPL19, RPL21, RPL6, RPL10A, UBC, RPS4X, RPL14, and RPL4 | 47 |
|
| ||
| Logistic Regression | RPL5 | 65 |
|
| ||
| Random Forest | RPL11, RPS6, RPL14, and RPL4 | 58 |
|
| ||
| Bayesian Networks | RPL18, RPS16, RPS6, RPS4X, RPS8, RPS2, RPL5, RPL21, UBC, and RPL4 | 64 |