| Literature DB >> 34073456 |
Kevin McCoy1, Sateesh Gudapati1,2, Lawrence He1, Elaina Horlander1, David Kartchner1,3, Soham Kulkarni1,4, Nidhi Mehra1, Jayant Prakash1,2, Helena Thenot1, Sri Vivek Vanga1,3, Abigail Wagner1, Brandon White1, Cassie S Mitchell1,5.
Abstract
Link prediction in artificial intelligence is used to identify missing links or derive future relationships that can occur in complex networks. A link prediction model was developed using the complex heterogeneous biomedical knowledge graph, SemNet, to predict missing links in biomedical literature for drug discovery. A web application visualized knowledge graph embeddings and link prediction results using TransE, CompleX, and RotatE based methods. The link prediction model achieved up to 0.44 hits@10 on the entity prediction tasks. The recent outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), also known as COVID-19, served as a case study to demonstrate the efficacy of link prediction modeling for drug discovery. The link prediction algorithm guided identification and ranking of repurposed drug candidates for SARS-CoV-2 primarily by text mining biomedical literature from previous coronaviruses, including SARS and middle east respiratory syndrome (MERS). Repurposed drugs included potential primary SARS-CoV-2 treatment, adjunctive therapies, or therapeutics to treat side effects. The link prediction accuracy for nodes ranked highly for SARS coronavirus was 0.875 as calculated by human in the loop validation on existing COVID-19 specific data sets. Drug classes predicted as highly ranked include anti-inflammatory, nucleoside analogs, protease inhibitors, antimalarials, envelope proteins, and glycoproteins. Examples of highly ranked predicted links to SARS-CoV-2: human leukocyte interferon, recombinant interferon-gamma, cyclosporine, antiviral therapy, zidovudine, chloroquine, vaccination, methotrexate, artemisinin, alkaloids, glycyrrhizic acid, quinine, flavonoids, amprenavir, suramin, complement system proteins, fluoroquinolones, bone marrow transplantation, albuterol, ciprofloxacin, quinolone antibacterial agents, and hydroxymethylglutaryl-CoA reductase inhibitors. Approximately 40% of identified drugs were not previously connected to SARS, such as edetic acid or biotin. In summary, link prediction can effectively suggest repurposed drugs for emergent diseases.Entities:
Keywords: COVID-19; SARS-CoV-2; coronavirus; literature review; machine learning; natural language processing; repurposed drugs; text mining
Year: 2021 PMID: 34073456 PMCID: PMC8230210 DOI: 10.3390/pharmaceutics13060794
Source DB: PubMed Journal: Pharmaceutics ISSN: 1999-4923 Impact factor: 6.525
Figure 1Visualization of subgraph of SemNet Knowledge graph.
Figure 2Link prediction and its sub tasks. For a given triple (h,r,t) (a) represents Relation prediction task and (b,c) represent Entity prediction task. Here, h is head entity, t is tail entity and r is relation.
Figure 3The link prediction pipeline and its 3 main stages: triple extraction, model training, and model deployment.
Figure 4Steps involved in Knowledge graph construction stage.
Figure 5Distribution of most prevalent node types in SemNet ([16]). “Rest of node types” represents the aggregate of remaining node types not individually listed in the figure due to space constraints.
Figure 6Distribution of different relation types in SemNet ([16]). “Rest of relation types” represents the aggregate of remaining relations types not listed in the figure due to space constraints.
Statistics of the SemNet-COVID training data.
| Properties | Statistics |
|---|---|
| Entities | 74,086 |
| Triples | 8,928,797 |
| Entity types | 121 |
| Relation types | 61 |
| Appended relation types | 25,341 |
| Training triples | 8,828,797 |
| Validation triples | 50,000 |
| Test triples | 50,000 |
Figure 7The Entity embeddings (TransE) of top 25 frequent entity groups.
Figure 8The end-to-end process of ranking link prediction results for a given query.
Example results from the Embedding API with different input queries. Input Parameters are (entity: as shown in table, size: 5 and method: Complex).
| Query: SARS Coronavirus | Query: Chloroquine | Query: Cyclosporine |
|---|---|---|
| Middle East Respiratory Syndrome Coronavirus | Hydroxychloroquine | Calcineurin |
| Genus: Coronavirus | Polymyxin B Sulfate | rituximab |
| SARS coronavirus Urbani | Aminoquinoline | infliximab |
| Beluga Whale coronavirus SW1 | Bauxite | Calcitonin Receptor |
| SARS-related bat coronavirus | Mefloquine | Neoral |
Example results from the Relation prediction API with 5 different queries.
| Query | Top Ranked Relation |
|---|---|
| (Chloroquine, ?, SARS Coronavirus) | treats |
| (SARS Coronavirus, ?, Chloroquine) | location_of |
| (Dexamethasone, ?, SARS Coronavirus) | treats |
| (Albuterol, ?, Dornase Alfa) | uses |
| (Nucelocapsid, ?, SARS Coronavirus) | part_of |
Example results from the Entity prediction API with two different queries. In the first column, the query to the API (entity: SARS coronavirus, relation: treats, size: 5, method: Ensemble, is_head: False) and head entity predictions are displayed. In the second column, query (entity: Chloroquine, relation: Chloroquine, size: 5, method: Ensemble, is_head: True) and the predictions of tail entity are displayed.
| Query: (?, treats, SARS Coronavirus) | Query: (Chloroquine, Chloroquine, ?) |
|---|---|
| Chloroquine | Ebola Virus |
| Duration | Virus |
| Glycyrrhizic Acid | SARS coronavirus |
| Ritonavir | Zika Virus |
| Octanoic acid | HIV |
The top 5 nodes with each link prediction validation label, ranked by lowest HeteSim scores. Lower standardized HeteSim scores corresponds to a stronger relationship. The complete list of top ranked nodes (n = 180, equivalent to the top 20 ranked nodes for each node type) can be found in Supplementary Table S1.
| Node | Drug Class | Node Type | Standardized HeteSim Score | Predicted Link | Pharmacokinetics |
|---|---|---|---|---|---|
| Top 5 PROVEN Nodes | |||||
| Chloroquine | glycoproteins big | OrganicChemical | 0.073 | treats | Primary |
| Glycyrrhizic Acid | anti-inflammatory | OrganicChemical | 0.074 | treats | Primary |
| Quinine | anti-inflammatory | OrganicChemical | 0.077 | treats | Primary |
| Chloroquine | antimalarial | OrganicChemical | 0.077 | treats | Primary |
| Fluoroquinolones | antimalarial | OrganicChemical | 0.077 | treats | Adjunctive |
| Top 5 DISPROVEN Nodes | |||||
| Polyamines | antimalarial | OrganicChemical | 0.080 | treats | Other |
| Complement System Proteins | neuraminidase inhibitors | ImmunologicFactor | 0.101 | prevents | Other |
| Dopamine Receptor | neuraminidase inhibitors | Receptor | 0.103 | treats | Other |
| Chemokine (C-C Motif) Receptor 5|CCR5 | envelope protein | Receptor | 0.104 | treats | Other |
| Antiviral prophylaxis | nucleoside analogs | TherapeuticOr PreventiveProcedure | 0.107 | neg treats | Primary |
| Top 5 UNCLEAR Nodes | |||||
| small molecule | immunomodulators | OrganicChemical | 0.069 | prevents | N/A |
| ebselen | anti-inflammatory | OrganicChemical | 0.077 | treats | N/A |
| Fluticasone propionate | anti-inflammatory | OrganicChemical | 0.079 | treats | N/A |
| Quinolone Antibacterial Agents | anti-inflammatory | OrganicChemical | 0.082 | prevents | N/A |
| Morphine | anti-inflammatory | OrganicChemical | 0.084 | treats | N/A |
| Top 5 MISSING Nodes | |||||
| small molecule | glycoproteins big | OrganicChemical | 0.056 | prevents | N/A |
| RABBIT SERUM | glycoproteins big | OrganicChemical | 0.061 | N/A | N/A |
| Esters | anti-inflammatory | OrganicChemical | 0.070 | N/A | N/A |
| Edetic Acid | glycoproteins big | OrganicChemical | 0.074 | treats | N/A |
| small molecule | protease inhibitors | OrganicChemical | 0.075 | N/A | N/A |
Figure 9(A) Pie chart illustrating the composition of the COVID-19 case study dataset by link prediction evaluation. (B) Violin plot showing the distribution of standardized HeteSim scores between each link prediction evaluation. Lower HeteSim means a closer relationship between the source node and tail node. (C) Confusion matrix for the link prediction in the COVID case study. “MISSING” and “UNCLEAR” nodes were left out as the true relationship is unknown. Sensitivity = 0.975, specificity = 0.375.
Figure 10(A) Violin plot showing the distribution of standardized HeteSim scores between each pharmacokinetic label. Lower HeteSim score means a closer relationship between the source node and tail node. (B) Violin plot showing the distribution of standardized HeteSim scores between each node type. (C) Violin plot showing the distribution of standardized HeteSim scores between each drug class.