| Literature DB >> 29783926 |
Gamal Crichton1, Yufan Guo2, Sampo Pyysalo2, Anna Korhonen2.
Abstract
BACKGROUND: Link prediction in biomedical graphs has several important applications including predicting Drug-Target Interactions (DTI), Protein-Protein Interaction (PPI) prediction and Literature-Based Discovery (LBD). It can be done using a classifier to output the probability of link formation between nodes. Recently several works have used neural networks to create node representations which allow rich inputs to neural classifiers. Preliminary works were done on this and report promising results. However they did not use realistic settings like time-slicing, evaluate performances with comprehensive metrics or explain when or why neural network methods outperform. We investigated how inputs from four node representation algorithms affect performance of a neural link predictor on random- and time-sliced biomedical graphs of real-world sizes (∼ 6 million edges) containing information relevant to DTI, PPI and LBD. We compared the performance of the neural link predictor to those of established baselines and report performance across five metrics.Entities:
Keywords: Data mining; Drug-target interaction; Link prediction; Literature-based discovery; Neural networks
Mesh:
Year: 2018 PMID: 29783926 PMCID: PMC5963080 DOI: 10.1186/s12859-018-2163-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Visualisation of ‘Viral Pneumonia’ and ‘Hydrochloric Acid’ from PubTator dataset. Nodes representing respiratory infections are close to the former while those of acids and other chemicals are close to the latter
Node Combination methods. Binary operators are element-wise
| Operator | Definition |
|---|---|
| Average |
|
| Concatenate | |
| Hadamard | |
| Weighted-L1 | | |
| Weighted-L2 | | |
The datasets and their relevant details (undirected link count)
| Node | Link | Has | Link | |
|---|---|---|---|---|
| Dataset | count | count | dates | type |
| BioGRID | 65,026 | 1,076,308 | Yes | Published |
| interactions | ||||
| MATADOR | 3,704 | 15,843 | No | Drug-target |
| interactions | ||||
| PubTator | 265,148 | 6,854,054 | Yes | Literature co- |
| occurrences |
Baseline methods for node pair (u, v) with neighbour sets N(u) and N(v). (x) are the neighbours of the neighbours of x
| Bipartite | ||
|---|---|---|
| Name | Definition | definition |
| Adamic-Adar |
|
|
| Common Neighbours | | |
|
| Jaccard Index |
|
|
Time-sliced details (Note: Induction includes Train)
| Link | Time | Link | Link | |
|---|---|---|---|---|
| Dataset | use | slice | count | percentage (%) |
| BioGRID | Induction | 1970-2014 | 678,994 | 63.08 |
| Train | 2013-2014 | 121,442 | 11.28 | |
| Test | 2015-2017 | 397,302 | 36.91 | |
| PubTator | Induction | 1873-2003 | 4,069,683 | 59.38 |
| Train | 2001-2003 | 614,031 | 5.90 | |
| Test | 2004-2017 | 2,784,371 | 40.62 |
MATADOR random-slice results
| Node | AUC | AUC | Avg. | Prec | ||
|---|---|---|---|---|---|---|
| Method | combination | (ROC) | (PR) | MAP | R-prec | @ |
| Deep- | Average | 95.93 | 95.82 | 89.81 | 86.86 | 98.77* |
| Walk | Concat | 94.97 | 94.83 | 88.30 | 84.63 | 98.34* |
| LINE | Average | 80.63 | 81.30 | 67.74 | 61.04 | 91.65 |
| Concat | 81.16 | 81.82 | 68.53 | 61.42 | 92.00 | |
| node- | Average | 78.38 | 78.75 | 66.42 | 59.32 | 88.67 |
| 2vec | Concat | 77.62 | 77.54 | 65.44 | 58.40 | 87.25 |
| AA | N/A | 91.97 | 88.40 | 87.16 | 85.06 | 86.87 |
| CN | N/A |
| 97.04* |
|
| 98.74* |
| JI | N/A | 97.23* |
| 94.72 | 92.29 |
|
(Bold: best score, *: not statistically different from best)
BioGRID random-slice and time-slice results
| Random slice | Time slice | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Node | AUC | AUC | Avg. | Prec | AUC | AUC | Avg. | Prec | |||
| Method | combination | (ROC) | (PR) | MAP | R-prec | @ | (ROC) | (PR) | MAP | R-prec | @ |
| Deep- | Average | 97.69 | 97.62 | 79.24 | 73.86 | 99.30 | 89.40 | 90.10 | 68.94 | 63.30 | 97.25* |
| Walk | Concat | 97.74 | 97.65 | 82.48 | 77.70 | 99.18 | 92.12 | 92.78 | 71.61 | 65.96 | 98.04 |
| LINE | Average | 98.10* | 97.80* | 83.13* | 78.22* | 99.54* | 91.86 | 92.31 | 72.85 | 67.76 | 97.40 |
| Concat | 98.08 | 97.76 | 82.94 | 78.04 | 99.29 | 93.55 | 93.74 | 73.60 | 68.57 | 97.90 | |
| node- | Average | 98.32* | 97.97* | 85.70* | 81.17* | 99.38* |
|
| 74.91 |
| 98.26 |
| 2vec | Concat |
|
|
|
| 99.49* | 93.66 | 94.66* | 73.48 | 68.77 | 98.40* |
| AA | N/A | 86.10 | 90.75 | 70.97 | 57.65 | 96.13 | 77.46 | 87.69 | 74.84 | 61.39 | 98.10 |
| CN | N/A | 91.20 | 94.96 | 75.72 | 69.81 |
| 85.07 | 91.81 |
| 67.73 |
|
| JI | N/A | 90.80 | 93.95 | 73.93 | 68.79 | 98.59 | 84.74 | 90.20 | 75.60 | 67.49 | 97.45 |
(Bold: best score, *: not statistically different from best)
PubTator random-slice and time-slice results
| Random slice | Time slice | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Node | AUC | AUC | Avg. | Prec | AUC | AUC | Avg. | Prec | |||
| Method | combination | (ROC) | (PR) | MAP | R-prec | @ | (ROC) | (PR) | MAP | R-prec | @ |
| Deep- | Average | 98.85 | 99.01 | 83.67 | 75.97 | 99.93* | 93.86* | 95.51* | 70.78* | 62.16* |
|
| Walk | Concat |
|
|
| 85.46 | 99.94* |
|
|
|
|
|
| LINE | Average | 99.10* | 99.23* | 90.36* | 84.56 |
| 88.68* | 92.27* | 55.61* | 46.41* |
|
| Concat | 99.13 | 99.24 | 90.07 | 84.03 | 99.95* | 90.32 | 93.01 | 62.51 | 53.21 |
| |
| node- | Average | 98.71 | 98.90 | 82.98 | 75.29 | 99.94* | 88.40 | 92.07 | 55.72 | 46.48 | 99.87 |
| 2vec | Concat | 99.16 | 99.21 | 88.94 | 82.14 | 99.92* | 88.13 | 91.83 | 53.24 | 43.69 | 99.84 |
| AA | N/A | 92.92 | 84.56 | 56.48 | 66.38 | 83.33 | 85.10 | 80.24 | 35.49 | 40.13 | 90.56 |
| CN | N/A | 98.40 | 98.28 | 79.84 |
| 99.94* | 88.37 | 88.83 | 43.67 | 46.59 | 99.84 |
| JI | N/A | 92.36 | 87.59 | 65.44 | 59.74 | 91.21 | 86.08 | 83.52 | 38.66 | 38.75 | 94.27 |
(Bold: best score, *: not statistically different from best)