| Literature DB >> 35360842 |
Ozge Gurbuz1, Gregorio Alanis-Lobato2, Sergio Picart-Armada2, Miao Sun2, Christian Haslinger2, Nathan Lawless2, Francesc Fernandez-Albert2.
Abstract
Indication expansion aims to find new indications for existing targets in order to accelerate the process of launching a new drug for a disease on the market. The rapid increase in data types and data sources for computational drug discovery has fostered the use of semantic knowledge graphs (KGs) for indication expansion through target centric approaches, or in other words, target repositioning. Previously, we developed a novel method to construct a KG for indication expansion studies, with the aim of finding and justifying alternative indications for a target gene of interest. In contrast to other KGs, ours combines human-curated full-text literature and gene expression data from biomedical databases to encode relationships between genes, diseases, and tissues. Here, we assessed the suitability of our KG for explainable target-disease link prediction using a glass-box approach. To evaluate the predictive power of our KG, we applied shortest path with tissue information- and embedding-based prediction methods to a graph constructed with information published before or during 2010. We also obtained random baselines by applying the shortest path predictive methods to KGs with randomly shuffled node labels. Then, we evaluated the accuracy of the top predictions using gene-disease links reported after 2010. In addition, we investigated the contribution of the KG's tissue expression entity to the prediction performance. Our experiments showed that shortest path-based methods significantly outperform the random baselines and embedding-based methods outperform the shortest path predictions. Importantly, removing the tissue expression entity from the KG severely impacts the quality of the predictions, especially those produced by the embedding approaches. Finally, since the interpretability of the predictions is crucial in indication expansion, we highlight the advantages of our glass-box model through the examination of example candidate target-disease predictions.Entities:
Keywords: drug discovery; knowledge graphs; ontologies; target repositioning; target repurposing
Year: 2022 PMID: 35360842 PMCID: PMC8963915 DOI: 10.3389/fgene.2022.814093
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Overview of knowledge graph usage in drug discovery.
| Study | Purpose | Method | Data source |
|---|---|---|---|
|
| Drug-Drug Interaction: evaluating the different embedding methods in various Cross Validation schemes | Embedding: RDF2Vec, CBOW, Skip Gram, TransE, TransD ML Model: Logistic Regression, Naive Bayes, Random Forest | Bio2RDF |
|
| Drug-target interactions | Metapath + Random forest, SVM | Biological and chemical datasets |
|
| Drug target genes for Alzheimer’s Disease | Inference + enrichment analysis | TTD, DrugBank, PharmGKB, AlzGene |
|
| Drug centric KG | Positive and Unlabeled Learning * SVM, Decision Tree and Random Forest | PharmGKB, TTD, KEGG DRUG, DrugBank, SIDER and DID |
|
| Potential drugs for diseases | Logistic regression | Pubmed Abstracts |
|
| Potential drugs for diseases | TransE embedding + LSTM | Pubmed Abstracts |
|
| FDA approved drugs for rare diseases | Network proximity | Pubmed Abstracts |
|
| Predicting clinical failure | Tensor factorization + gene prioritization | 20% is from biomedical literature and biological data sources |
|
| Predicting Gene-Disease links | Embeddings + Random Forest | Gene-Disease links from Disgenet |
|
| Knowledge Graph construction to support drug discovery like predicting Gene-Disease links and | Embeddings (RESCAL) + XGBoot | Gene and Disease nodes and edges from public databases |
| KG for IE | Target repurposing | Tissue based semantic inferencing + Embeddings & Random Forest | Human curated full text literature + biological database |
FIGURE 1Knowledge graph schema and gene-disease prediction strategies. (A) Upper layer ontology with the entities and relations defining the structure and content of our knowledge graph. (B) Hop-based prediction strategies to find novel gene-disease associations via intermediary genes expressed in the same tissue at the RNA or protein levels. (C) Embedding-based prediction strategy to find novel gene-disease associations via distances/similarities in a latent space.
Validation scheme based on the date when the interaction was first reported.
| Node1 | Node2 | Interaction | First referenced | Graph | Type |
|---|---|---|---|---|---|
| Gi | Dk | hasDisease | ≤2010 | KG_Before2010 | Train data |
| Gi | Gj | activates | ≤2010 | KG_Before2010 | Train data |
| Gj | Dk | hasDisease | >2010 | KG_After2010 | Test data |
FIGURE 2Topological properties of KG_Before 2010. (A) In- and out-degree of the nodes in each node type. Also shown is the total degree, defined as the sum of the in- and out-degree. All node types have hubs with over 100 edges (log2 (101) ≈ 6.7). (B) PageRank centrality, by node type. (C) Probability of each node degree suggest a power law; both axes are log scaled. (D, E) Temporal evolution of gene-gene and gene-disease links between 1990 and 2021. Edges were filtered according to their first mention in the literature. (D) Evolution of the largest weakly connected component over time, in terms of node count, edge count, edge density and power law coefficient. (E) Details on the relative growth by node types (genes or diseases) and by edge types (gene-gene interactions and gene-disease annotations).
Types of inferencing and their overall performance scores based on a total of 5,176 reference gene-disease links reported after 2010. Average ± standard deviations are reported for the random predictions.
| Type of inferencing | Predicted links | Precision | Precision at100 | Precision (random) |
| Recall | Recall (random) |
|
|---|---|---|---|---|---|---|---|---|
| All the inferences | 170,506 | 0.0296 | 0.23 | 0.0152 ± 0.0003 | 2.55E-284 | 0.9737 | 0.5449 ± 0.0223 | 1.50E-81 |
| One-hop and protein tissue | 33,633 | 0.0817 | 0.21 | 0.0227 ± 0.0006 | 0.00E+00 | 0.5307 | 0.2234 ± 0.0060 | 0.00E+00 |
| One-hop and RNA tissue | 45,664 | 0.0794 | 0.3 | 0.0227 ± 0.0006 | 0.00E+00 | 0.7007 | 0.2235 ± 0.0061 | 0.00E+00 |
| Two-hop and protein tissue | 120,319 | 0.0319 | 0.14 | 0.0158 ± 0.0003 | 0.00E+00 | 0.7417 | 0.5247 ± 0.0088 | 4.50E-127 |
| Two-hop and RNA tissue | 167,939 | 0.0295 | 0.23 | 0.0157 ± 0.0003 | 7.10E-286 | 0.9571 | 0.5286 ± 0.0088 | 0.00E+00 |
| One-hop without tissue | 47,734 | 0.0787 | 0.30 | 0.0227 ± 0.0006 | 0.00E+00 | 0.7262 | 0.2235 ± 0.0061 | 0.00E+00 |
| Two-hops without tissue | 174,305 | 0.0291 | 0.23 | 0.0157 ± 0.0003 | 7.10E-286 | 0.9795 | 0.5286 ± 0.0088 | 0.00E+00 |
FIGURE 3Performance evaluation of the hop-based predictions. (A) Precision, Recall, F1 and Precision@100 metrics calculated from all the gene-disease links predicted by each hop-based approach. (B) ROC and Precision-Recall performance curves for all the hop-based prediction methods. (C) Area under the ROC (AUROC) and Precision-Recall (AUPRC) curves shown in (B).
Random Forest predictions on different embeddings.
| With tissue | No tissue | ||||||
|---|---|---|---|---|---|---|---|
| Category | Precision | Recall | F1 | Category | Precision | Recall | F1 |
| DistMult/Cos-sim/@all | 0.5339 | 0.1609 | 0.247278 | DistMult_notissue/Cos-sim/@all | 0.3747 | 0.0326 | 0.059981 |
| DistMult/Euclidean/@all | 0.6758 | 0.2413 | 0.355622 | DistMult_notissue/Euclidean/@all | 0.4152 | 0.0917 | 0.150222 |
| RDF2Vec/Cos-sim/@all | 0.4765 | 0.2057 | 0.287353 | RDF2Vec_notissue/Cos-sim/@all | 0.5711 | 0.1636 | 0.25434 |
| RDF2Vec/Euclidean/@all | 0.412 | 0.242 | 0.304905 | RDF2Vec_notissue/Euclidean/@all | 0.4074 | 0.1246 | 0.190835 |
| TransD/Cos-sim/@all | 0.7356 | 0.3827 | 0.503468 | TransD_notissue/Euclidean/@all | 0.6038 | 0.3066 | 0.40669 |
| TransD/Euclidean/@all | 0.5312 | 0.3462 | 0.419196 | TransD_notissue/Cos-sim/@all | 0.6794 | 0.1027 | 0.178428 |
| TransE/Cos-sim/@all |
|
|
| TransE_notissue/Cos-sim/@all | 0.6894 | 0.6049 | 0.644392 |
| TransE/Euclidean/@all | 0.6604 | 0.5085 | 0.57458 | TransE_notissue/Euclidean/@all | 0.5958 | 0.3098 | 0.407639 |
| TransH/Cos-sim/@all | 0.6922 | 0.5884 | 0.636093 | TransH_notissue/Euclidean/@all | 0.54 | 0.3021 | 0.387446 |
| TransH/Euclidean/@all | 0.6187 | 0.5818 | 0.599683 | TransH_notissue/Cos-sim/@all | 0.6601 | 0.6263 | 0.642756 |
Bold numbers show the highest performance.
FIGURE 4Embedding based gene-disease prediction evaluation. (A) Embedding performances in which gene-tissue links were included in the knowledge graph. (B) Embedding performances in which gene-tissue links were not included in the knowledge graph.
FIGURE 5Performance evaluation per disease. (A) Precision, recall and F1 metrics attained by each the top two best performing hop-based prediction methods. (B) Same as (A) but for the top two embedding methods. Only the top 10 diseases are shown based on the precision value. The numbers in parentheses indicate the total number of gene-disease links in the gold standard for that disease, the number of predicted gene-disease links and how many of those were positive, respectively.
FIGURE 6Example prediction from the knowledge graph. (A) TGFB1 is connected to Mental Disorders via IL6. (B) Ubiquitin is connected to ALS via c-Jun. Both panels also show the number of alternative connections from the genes to the predicted disease via one-hop and two-hop links.