| Literature DB >> 31194807 |
Xiaomin Liang1, Daifeng Li1, Min Song2, Andrew Madden1, Ying Ding3,4, Yi Bu3.
Abstract
Advances in machine learning and deep learning methods, together with the increasing availability of large-scale pharmacological, genomic, and chemical datasets, have created opportunities for identifying potentially useful relationships within biochemical networks. Knowledge embedding models have been found to have value in detecting knowledge-based correlations among entities, but little effort has been made to apply them to networks of biochemical entities. This is because such networks tend to be unbalanced and sparse, and knowledge embedding models do not work well on them. However, to some extent, the shortcomings of knowledge embedding models can be compensated for if they are used in association with graph embedding. In this paper, we combine knowledge embedding and graph embedding to represent biochemical entities and their relations as dense and low-dimensional vectors. We build a cascade learning framework which incorporates semantic features from the knowledge embedding model, and graph features from the graph embedding model, to score the probability of linking. The proposed method performs noticeably better than the models with which it is compared. It predicted links and entities with an accuracy of 93%, and its average hits@10 score has an average of 8.6% absolute improvement compared with original knowledge embedding model, 1.1% to 9.7% absolute improvement compared with other knowledge and graph embedding algorithm. In addition, we designed a meta-path algorithm to detect path relations in the biomedical network. Case studies further verify the value of the proposed model in finding potential relationships between diseases, drugs, genes, treatments, etc. Amongst the findings of the proposed model are the suggestion that VDR (vitamin D receptor) may be linked to prostate cancer. This is backed by evidence from medical databases and published research, supporting the suggestion that our proposed model could be of value to biomedical researchers.Entities:
Mesh:
Year: 2019 PMID: 31194807 PMCID: PMC6565371 DOI: 10.1371/journal.pone.0218264
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Degree distribution of Chem2Bio2Rdf.
Fig 2The framework of Node2vec.
Developed from Node2vec [27].
The statistics of dataset Chem2Bio2Rdf.
| Relation index | Relation | Relation description | Head entity type | tail entity type | Number of triplets |
|---|---|---|---|---|---|
| R1 | CHEBI | has chemical ontology | compound | chemical ontology | 14407 |
| R2 | chemogenomic | bind | compound | gene | 515865 |
| R3 | expression | express | compound | gene | 15884 |
| R4 | family_name | has gene family | gene family | gene | 7112 |
| R5 | hprd | protein-protein interaction | gene | gene | 29677 |
| R6 | tissue | expressed in | tissue | gene | 9730 |
| R7 | protein | has pathwat | pathway | gene | 10583 |
| R8 | drug | treated by | disease | compound | 909 |
| R9 | gene | caused by | disease | gene | 2646 |
| R10 | cid | induced by | side effect | compound | 8852 |
| R11 | substructure | has substructure | compound | substructure | 6030 |
| R12 | GO_id | has gene ontology | gene | gene ontology | 15884 |
Total triplets: 719865; Total entities: 295911
Fig 3Correlation between graph embedding and linking.
Fig 4The framework of knowledge embedding cascade model.
Fig 5Path prediction for a specific head and tail.
Link prediction results (accuracy).
| Relation | TranSparse | Cascade-s | Cascade-g | Cascade-sg | Cascade model (proposed) |
|---|---|---|---|---|---|
| R1 | 0.9779 | 0.9690 | 0.9757 | 0.9779 | |
| R2 | 0.6066 | 0.9322 | 0.9528 | 0.9541 | |
| R3 | 0.9081 | 0.9205 | 0.8534 | 0.9152 | |
| R4 | 0.8112 | 0.8913 | 0.9058 | 0.9203 | |
| R5 | 0.7892 | 0.8722 | 0.8666 | 0.8703 | |
| R6 | 0.9200 | 0.8943 | 0.9257 | 0.9171 | |
| R7 | 0.9742 | 0.9836 | 0.9225 | 0.9789 | |
| R8 | 0.9722 | 0.9722 | 0.9722 | ||
| R9 | 0.7000 | 0.7500 | 0.7333 | 0.7500 | |
| R10 | 0.9737 | 0.9770 | 0.9836 | 0.9934 | |
| R11 | 0.9845 | 0.9845 | 0.9845 | ||
| R12 | 0.7074 | 0.8353 | 0.8419 | 0.8448 | |
| Avg. | 0.8604 | 0.9208 | 0.9204 | 0.9233 |
Fig 6ROC Curve for part of relations on link prediction task.
Link Prediction based on settings of different graph embedding algorithms (accuracy).
| Relation | Graph Embedding | Cascade-g | Cascade-sg | Cascade model (proposed) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| L | N | L | S | N | L | S | N | L | S | N | |
| R1 | 0.5000 | 0.9668 | 0.7323 | 0.7301 | 0.9757 | 0.9712 | 0.9690 | 0.9779 | 0.9735 | 0.9690 | |
| R2 | 0.7218 | 0.9532 | 0.8466 | 0.6642 | 0.9364 | 0.9344 | 0.9528 | 0.9363 | 0.9337 | 0.9541 | |
| R3 | 0.8834 | 0.7809 | 0.9028 | 0.6590 | 0.8534 | 0.9223 | 0.9205 | 0.9205 | 0.9187 | 0.9205 | |
| R4 | 0.8841 | 0.8841 | 0.8696 | 0.7101 | 0.9058 | 0.8986 | 0.8841 | 0.9203 | 0.8986 | 0.8841 | |
| R5 | 0.9263 | 0.8591 | 0.6950 | 0.8927 | 0.8713 | 0.8722 | 0.8668 | 0.8685 | 0.8731 | 0.8703 | |
| R6 | 0.9000 | 0.7514 | 0.6429 | 0.8943 | 0.9400 | 0.9314 | 0.9257 | 0.9314 | 0.9171 | ||
| R7 | 0.9390 | 0.9085 | 0.9577 | 0.6596 | 0.9225 | 0.9836 | 0.9812 | 0.9789 | |||
| R8 | 0.9167 | 0.9167 | 0.9444 | 0.5833 | 0.9444 | 0.9722 | 0.9444 | ||||
| R9 | 0.9333 | 0.7833 | 0.7333 | 0.8167 | 0.8000 | 0.7500 | 0.7333 | 0.8000 | 0.7667 | 0.8100 | |
| R10 | 0.6546 | 0.9276 | 0.7566 | 0.7171 | 0.9836 | 0.9901 | 0.9770 | 0.9901 | 0.9770 | ||
| R11 | 0.5000 | 0.9330 | 0.8041 | 0.7474 | 0.9845 | 0.9845 | 0.9845 | 0.9845 | 0.9845 | ||
| R12 | 0.5000 | 0.8600 | 0.7423 | 0.6430 | 0.8317 | 0.8369 | 0.8419 | 0.8346 | 0.8320 | 0.8448 | |
| Avg. | 0.7716 | 0.8771 | 0.8650 | 0.6821 | 0.9204 | 0.9228 | 0.9203 | 0.9233 | 0.9231 | 0.9188 | |
L: LINE; S: SDNE; N: Node2vec
TranSparse is selected as knowledge embedding algorithms in the proposed cascade model.
Link Prediction based on settings of different knowledge embedding algorithms (accuracy).
| Relation | Cascade (TransE) | Cascade (TransH) | Cascade (TransR) | Cascade (TranSparse) |
|---|---|---|---|---|
| R1 | 0.9823 | 0.9757 | 0.9801 | |
| R2 | 0.9593 | 0.954 | 0.9541 | |
| R3 | 0.9276 | 0.9081 | 0.9205 | |
| R4 | 0.9203 | 0.9275 | ||
| R5 | 0.8573 | 0.8629 | 0.8703 | |
| R6 | 0.9171 | 0.9000 | 0.9171 | |
| R7 | 0.9812 | 0.9765 | 0.9789 | |
| R8 | 0.9444 | 0.9167 | ||
| R9 | 0.8333 | 0.8100 | ||
| R10 | 0.9934 | 0.9901 | 0.9868 | |
| R11 | 0.9845 | 0.9794 | ||
| R12 | 0.9024 | 0.8488 | 0.8448 | |
| Avg. | 0.9322 | 0.9280 | 0.9299 |
Node2vec is selected as graph embedding algorithms in the proposed cascade model.
Fig 7Sensitivity of proposed cascade model to parameters of Node2vec.
Entity prediction results.
| Models | Mean Rank | hits@10(Avg.) |
|---|---|---|
| Node2Vec | 4999 | 0.0377 |
| TransE | 10120 | 0.1357 |
| TransH | 3784 | 0.1571 |
| TransR | 4935 | 0.2176 |
| TransSparse | 3772 | 0.2398 |
| Cascade-s | 3678 | 0.2285 |
| Cascade-g | 3751 | 0.0700 |
| Cascade-sg | 2174 | 0.2988 |
| Cascade model (proposed) |
Entity prediction based on settings of different graph embedding algorithms (hits@10).
| Relation | (Predict head + Predict tail) / 2 | ||
|---|---|---|---|
| Cascade-LINE | Cascade-SDNE | Cascade-Node2vec | |
| R1 | 0.2677 | 0.2699 | |
| R2 | 0.4798 | 0.4907 | |
| R3 | 0.1414 | 0.1414 | |
| R4 | 0.3841 | 0.4203 | |
| R5 | 0.1941 | 0.1885 | |
| R6 | 0.1572 | 0.1629 | |
| R7 | 0.2888 | 0.2817 | |
| R8 | 0.2500 | 0.3056 | |
| R9 | 0.1833 | 0.1667 | |
| R10 | 0.0625 | 0.0592 | |
| R11 | 0.1289 | 0.1289 | |
| R12 | 0.2022 | 0.1986 | |
| Avg. | 0.2283 | 0.2345 | |
TranSparse is selected as knowledge embedding algorithms in the proposed cascade model.
Entity prediction based on settings of different knowledge embedding algorithms (hits@10).
| Relation | (Predict head + Predict tail) / 2 | |||
|---|---|---|---|---|
| Cascade (TransE) | Cascade (TransH) | Cascade (TransR) | Cascade (TranSparse) | |
| R1 | 0.4248 | 0.4647 | 0.4115 | |
| R2 | 0.5449 | 0.5541 | 0.5395 | |
| R3 | 0.1855 | 0.2226 | 0.1396 | 0.2138 |
| R4 | 0.4203 | 0.4565 | 0.3551 | |
| R5 | 0.1801 | 0.1903 | 0.1829 | |
| R6 | 0.1743 | 0.18 | 0.1943 | |
| R7 | 0.4179 | 0.4062 | 0.4413 | |
| R8 | 0.3611 | 0.3889 | 0.3334 | |
| R9 | 0.2334 | 0.25 | 0.2334 | |
| R10 | 0.0724 | 0.0724 | 0.0823 | |
| R11 | 0.2629 | 0.299 | 0.2629 | |
| R12 | 0.2203 | 0.236 | 0.1976 | |
| Avg. | 0.2915 | 0.3148 | 0.2807 | |
Node2vec is selected as graph embedding algorithms in the proposed cascade model.
Hit@10 rate in each relation on biochemical data set.
n stands for the average number of head entities(respectively. tail entities) on dataset given a pair (r, t)(respectively (h,r)).
| Index | Relation | n-n | Predict head | Predict tail | ||
|---|---|---|---|---|---|---|
| TranSparse | Cascade model | TranSparse | Cascade model | |||
| R1 | Has chemical ontology | 5.2-16 | 0.208 | 0.3319 | ||
| R2 | Bind | 109.4-2 | 0.1222 | 0.856 | ||
| R3 | Express | 3.7-10.4 | 0.1413 | 0.1307 | ||
| R4 | Has gene family | 1-21.6 | 0.6957 | 0.1159 | ||
| R5 | Protein-Protein interaction | 4.4-4.1 | 0.1642 | 0.2071 | ||
| R6 | Expressed in | 2.5-19.2 | 0.2914 | 0.0457 | ||
| R7 | Has participants | 2.8-55.1 | 0.4977 | 0.1033 | ||
| R8 | Treated by | 1.6-4.8 | 0.5 | 0.2222 | ||
| R9 | Caused by | 1.5-2.1 | 0.3333 | 0 | ||
| R10 | Induced by | 11.2-8.4 | 0.0592 | 0.0789 | ||
| R11 | Has substructure | 20.8-4.6 | 0.0309 | 0.2165 | ||
| R12 | Has gene ontology | 9.1-6.1 | 0.14 | 0.2630 | ||
| Avg. | - | - | 0.2653 | 0.2143 | ||
Top 30 drug-disease-gene paths.
The relations treat, caused by, and bind are associated with, respectively, drug-disease, disease-gene and drug-gene. The value x/y of indicates whether or not the relation exists in a data set: x = 1 indicates the presence of a relation in training or test sets; y = 1 indicates the presence of a relation in databases DisGeNET, DrugBank, etc.
| Order number | Compound | Disease | Gene | Compound-Disease | Disease-Gene | Compound-Gene |
|---|---|---|---|---|---|---|
| 1 | Delavirdine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 2 | Atazanavir | HIV | GAG-POL | 1/1 | 0/1 | 0/0 |
| 3 | Zidovudine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 4 | Tenofovir disoproxil | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 5 | Zalcitabine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 6 | Didanosine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 7 | Emtricitabine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 8 | Zidovudine | HIV | DEOA | 1/1 | 0/0 | 0/0 |
| 9 | Mercaptopurine | Leukemia | VDR | 1/1 | 0/0 | 1/0 |
| 10 | Efavirenz | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 11 | Zalcitabine | HIV | DEOA | 1/1 | 0/0 | 0/0 |
| 12 | Lamivudine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 13 | Nevirapine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 14 | Didanosine | HIV | DEOA | 1/1 | 0/0 | 0/0 |
| 15 | Emtricitabine | HIV | DEOA | 1/1 | 0/0 | 0/0 |
| 16 | Calcidiol | Hypoparathyroidism | VDR | 1/1 | 0/0 | 1/1 |
| 17 | Cidofovir | Immunodeficiency | UL54 | 1/1 | 0/0 | 1/0 |
| 18 | Calcitriol | Hypocalcemia | CASR | 1/1 | 1/0 | 0/0 |
| 19 | Calcidiol | Hypocalcemia | CASR | 1/1 | 1/0 | 0/0 |
| 20 | Ganciclovir | Immunodeficiency | UL54 | 1/1 | 0/0 | 0/0 |
| 21 | Daunorubicin | Leukemia | VDR | 1/1 | 0/0 | 1/0 |
| 22 | Stavudine | HIV | GAG-POL | 1/1 | 0/1 | 1/0 |
| 23 | Calcidiol | Hypoparathyroidism | CYP24A1 | 1/1 | 0/0 | 0/0 |
| 24 | Vincristine | Leukemia | VDR | 0/1 | 0/0 | 1/0 |
| 25 | Calcitriol | Osteoporosis | FGF23 | 1/1 | 0/0 | 0/0 |
| 26 | Risedronate | Osteoporosis | FDPS | 1/1 | 0/0 | 1/0 |
| 27 | Foscarnet | HIV | GAG-POL | 0/1 | 0/1 | 0/0 |
| 28 | Fluorouracil | Immunodeficiency | UNG | 0/1 | 1/0 | 0/0 |
| 29 | Propylthiouracil | HIV | DEOA | 0/1 | 0/0 | 0/0 |
| 30 | Lamivudine | HIV | DEOA | 1/1 | 0/0 | 0/0 |
Matching results of drug-disease-gene top 100 paths with database.
“Number of triplets” is the number of triplets in specific relation involved in top 100 paths. “Predictions” is the number of relations neither in data sets nor in chosen databases. “Proven predictions” is the number of relations not in data sets but matched with chosen databases.
| Relations | Triplets | Predictions | Proven predictions |
|---|---|---|---|
| Treat(compound-disease) | 76 | 26 | 26 |
| Caused by(disease-gene) | 37 | 30 | 1 |
| Bind(compound-target) | 90 | 49 | 0 |