| Literature DB >> 30533510 |
Kushal Veer Singh1, Lovekesh Vig1.
Abstract
Interactomes such as Protein interaction networks have many undiscovered links between entities. Experimental verification of every link in these networks is prohibitively expensive, and therefore computational methods to direct the search for possible links are of great value. The problem of finding undiscovered links in a network is also referred to as the link prediction problem. A popular approach for link prediction has been to formulate it as a binary classification problem in which class labels indicate the existence or absence of a link (we refer to these as positive links or negative links respectively) between a pair of nodes in the network. Researchers have successfully applied such supervised classification techniques to determine the presence of links in protein interaction networks. However, it is quite common for protein-protein interaction (PPI) networks to have a large proportion of undiscovered links. Thus, a link prediction approach could incorrectly treat undiscovered positive links as negative links, thereby introducing a bias in the learning. In this paper, we propose to denoise the class of negative links in the training data via a Gaussian process anomaly detector. We show that this significantly reduces the noise due to mislabelled negative links and improves the resulting link prediction accuracy. We evaluate the approach by introducing synthetic noise into the PPI networks and measuring how accurately we can reconstruct the original PPI networks using classifiers trained on both noisy and denoised data. Experiments were performed with five different PPI network datasets and the results indicate a significant reduction in bias due to label noise, and more importantly, a significant improvement in the accuracy of detecting missing links via classification.Entities:
Keywords: Anomaly detection; Link prediction; Protein protein interaction networks
Year: 2017 PMID: 30533510 PMCID: PMC6245231 DOI: 10.1007/s41109-017-0022-7
Source DB: PubMed Journal: Appl Netw Sci ISSN: 2364-8228
Network datasets
| Network | Type | No. of nodes | No. of edges |
|---|---|---|---|
| Arabidopsis thaliana | Undirected | 7550 | 19962 |
| Caenorhabditis elegans | Undirected | 5758 | 14829 |
| Mus musculus | Undirected | 6236 | 13865 |
| Rattus norvegicus | Undirected | 2448 | 3804 |
Fig. 1Experimental setup overview. a Training and Testing of Anomaly Detection Methods. b Classifiers prediction on the unfiltered and filtered dataset
Anomaly detection techniques comparison under different metrics
| Network | Anomaly Method | Accuracy | F-measure | Sensitivity | Specificity | FP rate | FN rate |
|---|---|---|---|---|---|---|---|
| Arabidopsis | gpoc | 96.69 | 96.77 | 99.19 | 94.19 | 5.81 | 0.81 |
| thaliana | parzen | 85.75 | 85.12 | 81.50 | 90.00 | 10.00 | 18.50 |
| pca | 69.25 | 76.48 | 1 | 38.50 | 61.50 | 0 | |
| nn | 77.50 | 81.63 | 1 | 55.00 | 45.00 | 0 | |
| Caenorhabditis | gpoc | 90.98 | 91.59 | 98.23 | 83.73 | 16.27 | 1.77 |
| elegans | parzen | 69.63 | 76.61 | 99.50 | 39.75 | 60.25 | 0.50 |
| pca | 56.13 | 69.50 | 1 | 12.25 | 87.75 | 0 | |
| nn | 56.37 | 69.63 | 1 | 12.75 | 87.25 | 0 | |
| Mus musculus | gpoc | 94.90 | 95.13 | 99.56 | 90.24 | 9.76 | 0.44 |
| parzen | 90.62 | 91.39 | 99.50 | 81.75 | 18.25 | 0.50 | |
| pca | 70.37 | 77.15 | 1 | 40.75 | 59.25 | 0 | |
| nn | 76.50 | 80.97 | 1 | 53.00 | 47.00 | 0 | |
| Rattus | gpoc | 98.10 | 98.13 | 99.68 | 96.52 | 3.48 | 0.32 |
| norvegicus | parzen | 96.13 | 96.25 | 99.50 | 92.75 | 7.25 | 0.50 |
| pca | 77.25 | 81.47 | 1 | 54.50 | 45.50 | 0 | |
| nn | 91.37 | 92.06 | 1 | 82.75 | 17.25 | 0 |
Anomaly detector performance measure
| Network | TPR | TNR |
|---|---|---|
| Arabidopsis thaliana | 99.62 | 93.23 |
| Caenorhabditis elegans | 99.22 | 82.33 |
| Mus musculus | 99.70 | 91.50 |
| Rattus norvegicus | 99.65 | 95.58 |
Classification comparison Under different metrics using simple random sampling
| Network | Classifier | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| Arabidopsis | SVM | 93.07 | 92.80 | 93.07 | 96.85 | 96.94 | 96.85 |
| thaliana | C5.0* | 99.35 | 99.35 | 99.35 | 99.34 | 99.34 | 99.34 |
| KNN | 94.24 | 93.97 | 94.24 | 98.42 | 98.44 | 98.42 | |
| NB | 62.29 | 40.54 | 62.29 | 84.22 | 82.24 | 84.22 | |
| Caenorhabditis | SVM | 87.67 | 86.02 | 87.67 | 94.02 | 94.34 | 94.02 |
| elegans | C5.0 | 97.81 | 97.78 | 97.81 | 98.30 | 98.32 | 98.30 |
| KNN | 92.99 | 92.73 | 92.99 | 96.07 | 96.20 | 96.07 | |
| NB | 59.18 | 36.39 | 59.18 | 67.23 | 57.46 | 67.23 | |
| Mus musculus | SVM | 93.30 | 93.03 | 93.31 | 96.83 | 96.93 | 96.84 |
| C5.0 | 98.06 | 98.04 | 98.06 | 99.32 | 99.33 | 98.32 | |
| KNN | 94.12 | 93.84 | 94.12 | 98.47 | 98.49 | 98.47 | |
| NB | 60.38 | 35.23 | 60.38 | 79.32 | 75.55 | 79.32 | |
| Rattus | SVM | 91.65 | 90.79 | 91.65 | 98.79 | 98.81 | 98.79 |
| norvegicus | C5.0 | 91.34 | 90.54 | 91.34 | 99.46 | 99.45 | 99.45 |
| KNN | 88.15 | 86.57 | 88.15 | 99.35 | 99.35 | 99.35 | |
| NB | 66.29 | 49.32 | 66.29 | 84.25 | 82.00 | 84.25 | |
where * shows the p-value >.05
Classification comparison Under different metrics using balanced random sampling
| Network | Classifier | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| Arabidopsis | SVM | 92.99 | 92.70 | 92.99 | 97.01 | 97.09 | 97.01 |
| thaliana | C5.0* | 99.41 | 99.76 | 99.41 | 99.41 | 99.41 | 99.41 |
| KNN | 94.81 | 94.60 | 94.81 | 98.43 | 98.45 | 98.43 | |
| NB | 62.55 | 41.18 | 62.55 | 86.77 | 85.53 | 86.77 | |
| Caenorhabditis | SVM | 87.02 | 85.18 | 87.02 | 93.60 | 93.97 | 93.60 |
| elegans | C5.0* | 97.77 | 97.74 | 97.77 | 97.95 | 97.97 | 97.95 |
| KNN | 92.63 | 92.33 | 92.63 | 96.20 | 96.32 | 96.20 | |
| NB | 60.32 | 41.53 | 60.32 | 66.90 | 56.98 | 66.90 | |
| Mus musculus | SVM | 93.71 | 93.49 | 93.71 | 96.64 | 96.75 | 96.64 |
| C5.0 | 98.63 | 98.62 | 98.63 | 99.37 | 99.37 | 99.37 | |
| KNN | 93.14 | 92.74 | 93.14 | 98.38 | 98.41 | 98.38 | |
| NB | 59.67 | 33.26 | 59.67 | 78.60 | 74.78 | 78.60 | |
| Rattus | SVM | 94.61 | 94.35 | 94.61 | 98.56 | 98.57 | 98.56 |
| norvegicus | C5.0 | 90.75 | 89.84 | 90.75 | 99.40 | 99.40 | 99.40 |
| KNN | 86.20 | 84.04 | 86.20 | 99.37 | 99.37 | 99.37 | |
| NB | 67.42 | 51.85 | 67.42 | 82.73 | 80.13 | 82.73 | |
where * shows the p-value >.05
Fig. 2Accuracy Comparison of different classifiers with and without anomaly detection technique using simple random sampling
InfoGain values of different features for all PPI networks
| Networks | |||||||
|---|---|---|---|---|---|---|---|
| Arabidopsis | Caenorhabditis | Mus | Rattus | Average | Standard | ||
| thaliana | elegans | musculus | norvegicus | deviation | |||
| deg(u) | 0.32 | 0.10 | 0.24 | 0.260 | 0.23 | 0.09 | |
| deg(v) | 0.24 | 0.093 | 0.17 | 0.17 | 0.17 | 0.06 | |
| subgraph-edge-no(u) | 0.23 | 0.07 | 0.19 | 0.20 | 0.17 | 0.07 | |
| subgraph-edge-no(v) | 0.18 | 0.07 | 0.13 | 0.14 | 0.13 | 0.05 | |
| subgraph-edge-no[u] | 0.32 | 0.11 | 0.23 | 0.25 | 0.23 | 0.09 | |
| subgraph-edge-no[v] | 0.24 | 0.09 | 0.16 | 0.17 | 0.17 | 0.06 | |
| density-nbhd-subgraph(u) | 0.27 | 0.11 | 0.23 | 0.23 | 0.21 | 0.07 | |
| density-nbhd-subgraph(v) | 0.23 | 0.10 | 0.17 | 0.17 | 0.17 | 0.05 | |
| density-nbhd-subgraph[u] | 0.28 | 0.11 | 0.23 | 0.23 | 0.21 | 0.07 | |
| density-nbhd-subgraph[v] | 0.23 | 0.10 | 0.17 | 0.17 | 0.17 | 0.05 | |
|
|
|
|
|
|
|
| |
| TN(u,v) |
| 0.20 | 0.41 | 0.49 | 0.40 | 0.15 | |
| JC(u,v) | 0.08 | 0.03 | 0.037 | 0.12 | 0.06 | 0.04 | |
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
| |
| PA(u,v) |
| 0.23 | 0.42 |
| 0.42 | 0.14 | |
| CAR(u,v) | 0.08 | 0.06 | 0.11 | 0.09 | 0.08 | 0.02 | |
|
|
| 0.40 |
|
|
|
| |
| CAA(u,v) | 0.10 | 0.07 | 0.11 | 0.09 | 0.09 | 0.02 | |
| CRA(u,v) | 0.11 | 0.08 | 0.12 | 0.10 | 0.10 | 0.02 | |
| CJC(u,v) | 0.08 | 0.06 | 0.10 | 0.09 | 0.08 | 0.02 | |
|
|
|
|
|
|
|
| |
| | |
| 0.40 |
|
|
|
| |
| |nbhd-subgraph[u,v] | |
| 0.19 | 0.36 | 0.44 | 0.37 | 0.14 | |
| | |
|
|
|
|
|
| |
All features with average deviation (or average mean) > 0.5 for all networks are in bold. All values > 0.5 are in bold
Results for Classifier trained on Arabidopsis thaliana and tested on remaining datasets
| Classifier | Networks | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| SVM | Rattus norvegicus | 96.71 | 96.64 | 96.71 | 98.98 | 98.99 | 98.98 |
| Mus-Musculus | 93.78 | 93.68 | 93.78 | 95.79 | 95.96 | 95.79 | |
| C.elegans | 85.03 | 86.03 | 85.03 | 81.63 | 84.48 | 81.63 | |
| C5.0 | Rattus norvegicus | 99.46 | 99.46 | 99.46 | 99.45 | 99.45 | 99.45 |
| Mus-Musculus | 99.34 | 99.34 | 99.34 | 99.22 | 99.22 | 99.22 | |
| C.elegans | 96.92 | 97.00 | 96.92 | 95.72 | 95.89 | 95.72 | |
| KNN | Rattus norvegicus | 96.66 | 96.56 | 96.67 | 99.58 | 99.58 | 99.58 |
| Mus-Musculus | 95.45 | 95.31 | 95.45 | 98.18 | 98.20 | 98.18 | |
| C.elegans | 91.38 | 91.45 | 91.37 | 90.52 | 91.34 | 90.52 | |
| NB | Rattus norvegicus | 64.42 | 44.95 | 64.42 | 77.33 | 71.09 | 77.33 |
| Mus-Musculus | 60.78 | 36.47 | 60.78 | 76.14 | 70.10 | 76.15 | |
| C.elegans | 57.96 | 35.03 | 57.96 | 70.69 | 68.43 | 70.69 | |
Results for Classifier trained on Caenorhabditis elegans and tested on remaining datasets
| Classifier | Networks | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| SVM | Rattus norvegicus | 81.24 | 76.99 | 81.24 | 99.05 | 99.05 | 99.05 |
| Mus-Musculus | 82.57 | 78.98 | 82.57 | 98.23 | 98.25 | 98.23 | |
| A.thaliana | 81.28 | 77.04 | 81.28 | 97.93 | 97.95 | 97.93 | |
| C5.0 | Rattus norvegicus | 95.56 | 95.37 | 95.56 | 97.87 | 97.83 | 97.87 |
| Mus-Musculus | 95.07 | 94.82 | 95.07 | 98.48 | 98.46 | 98.48 | |
| A.thaliana | 94.77 | 94.49 | 94.77 | 98.68 | 98.66 | 98.68 | |
| KNN | Rattus norvegicus | 93.94 | 93.56 | 93.94 | 99.08 | 99.08 | 99.08 |
| Mus-Musculus | 93.45 | 93.02 | 93.45 | 98.53 | 98.52 | 98.53 | |
| A.thaliana | 93.29 | 92.87 | 93.29 | 98.29 | 98.29 | 98.29 | |
| NB | Rattus norvegicus | 65.38 | 47.24 | 65.38 | 74.37 | 65.78 | 74.37 |
| Mus-Musculus | 62.56 | 40.95 | 62.56 | 71.66 | 61.22 | 71.66 | |
| A.thaliana | 62.62 | 40.82 | 62.62 | 72.99 | 64.02 | 72.99 | |
Results for Classifier trained on Mus musculus and tested on remaining datasets
| Classifier | Networks | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| SVM | Rattus norvegicus | 96.93 | 96.86 | 96.93 | 99.02 | 99.02 | 99.02 |
| C.elegans | 83.96 | 84.87 | 83.96 | 81.52 | 84.40 | 81.52 | |
| A.thaliana | 92.83 | 92.67 | 92.83 | 95.66 | 95.84 | 95.66 | |
| C5.0 | Rattus norvegicus | 97.93 | 97.89 | 97.93 | 99.68 | 99.68 | 99.68 |
| C.elegans | 94.64 | 94.69 | 94.64 | 95.80 | 95.96 | 95.80 | |
| A.thaliana | 97.13 | 97.06 | 97.13 | 99.37 | 99.37 | 99.37 | |
| KNN | Rattus norvegicus | 94.64 | 94.36 | 94.64 | 99.72 | 99.43 | 99.72 |
| C.elegans | 91.50 | 91.63 | 91.50 | 91.68 | 92.31 | 91.68 | |
| A.thaliana | 93.43 | 93.12 | 93.43 | 98.26 | 98.29 | 98.26 | |
| NB | Rattus norvegicus | 65.63 | 47.83 | 65.63 | 79.26 | 74.37 | 79.26 |
| C.elegans | 57.06 | 33.21 | 57.06 | 69.10 | 67.95 | 69.10 | |
| A.thaliana | 62.28 | 40.61 | 62.28 | 81.72 | 79.44 | 81.72 | |
Results for Classifier trained on Rattus norvegicus and tested on remaining datasets
| Classifier | Networks | Without anomaly detection | With anomaly detection | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | F-measure | AUC | Accuracy | F-measure | AUC | ||
| SVM | Mus-Musculus | 90.89 | 90.68 | 90.89 | 92.81 | 93.29 | 92.81 |
| C.elegans | 77.71 | 80.01 | 77.71 | 74.33 | 79.57 | 74.33 | |
| A.thaliana | 90.91 | 90.76 | 90.91 | 92.57 | 93.07 | 92.57 | |
| C5.0 | Mus-Musculus | 84.24 | 81.46 | 84.24 | 98.61 | 98.62 | 98.61 |
| C.elegans | 82.43 | 79.73 | 82.43 | 94.31 | 94.59 | 94.31 | |
| A.thaliana | 82.03 | 78.25 | 82.03 | 98.75 | 98.75 | 98.75 | |
| KNN | Mus-Musculus | 84.38 | 81.85 | 84.38 | 96.63 | 96.72 | 96.63 |
| C.elegans | 81.45 | 79.86 | 81.45 | 85.88 | 87.57 | 85.88 | |
| A.thaliana | 83.11 | 80.19 | 83.11 | 96.57 | 96.66 | 96.57 | |
| NB | Mus-Musculus | 65.93 | 50.25 | 65.93 | 85.73 | 84.82 | 85.73 |
| C.elegans | 60.12 | 48.44 | 60.12 | 71.98 | 74.68 | 71.98 | |
| A.thaliana | 68.10 | 56.24 | 68.10 | 87.29 | 87.10 | 87.29 | |