| Literature DB >> 27196054 |
Joseph Mullen1, Simon J Cockell2, Peter Woollard3, Anil Wipat1.
Abstract
Drug development is both increasing in cost whilst decreasing in productivity. There is a general acceptance that the current paradigm of R&D needs to change. One alternative approach is drug repositioning. With target-based approaches utilised heavily in the field of drug discovery, it becomes increasingly necessary to have a systematic method to rank gene-disease associations. Although methods already exist to collect, integrate and score these associations, they are often not a reliable reflection of expert knowledge. Furthermore, the amount of data available in all areas covered by bioinformatics is increasing dramatically year on year. It thus makes sense to move away from more generalised hypothesis driven approaches to research to one that allows data to generate their own hypothesis. We introduce an integrated, data driven approach to drug repositioning. We first apply a Bayesian statistics approach to rank 309,885 gene-disease associations using existing knowledge. Ranked associations are then integrated with other biological data to produce a semantically-rich drug discovery network. Using this network, we show how our approach identifies diseases of the central nervous system (CNS) to be an area of interest. CNS disorders are identified due to the low numbers of such disorders that currently have marketed treatments, in comparison to other therapeutic areas. We then systematically mine our network for semantic subgraphs that allow us to infer drug-disease relations that are not captured in the network. We identify and rank 275,934 drug-disease has_indication associations after filtering those that are more likely to be side effects, whilst commenting on the top ranked associations in more detail. The dataset has been created in Neo4j and is available for download at https://bitbucket.org/ncl-intbio/genediseaserepositioning along with a Java implementation of the searching algorithm.Entities:
Mesh:
Year: 2016 PMID: 27196054 PMCID: PMC4873016 DOI: 10.1371/journal.pone.0155811
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of approach to identify novel drug-disease (Dr-D) associations.
Gene-disease associations from 10 sources are first integrated and ranked. These scored associations are then integrated with protein, gene, disease and drug data to give an integrated dataset. A therapeutic area of application is then identified before the dataset is mined for instances of a semantic subgraph whose mappings contain inferred dd associations. Finally, any dd associations that are likely side effects (SEs) are filtered using the MeSH distance measure, before all ranked dd associations are returned.
Data sources of gene-disease associations.
| Source | Version/Accessed | Type | #Associations | #Map MeSH | % Map MeSH |
|---|---|---|---|---|---|
| CTD [ | Jul_02_2015/Aug’15 | Curated | 24,346 | 23,813 | 97.8 |
| OMIM® [ | 18-08-2015/Aug’15 | Curated | 5,143 | 3,375 | 65.6 |
| Orphanet [ | 2015_07_31/Jul’15 | Curated | 6,094 | 1,744 | 28.6 |
| UniProtKB [ | 2015_08 | Curated | 4,679 | 3,203 | 68.5 |
| GWAS Catologue [ | 24_08_2015/Aug’15 | Experimental | 13,326 | 5,112 | 38.4 |
| BeFree [ | 24-Aug-2015/Aug’15 | Literature | 330,888 | 233,264 | 70.5 |
| GoF/LoF | -/Oct’15 | Literature | 4,793 | 3,459 | 72.2 |
| SemRep [ | 25/Feb’15 | Literature | 96,024 | 72,908 | 75.9 |
| MGD [ | 24_08_2015/Aug’15 | Predicted | 1,943 | 1,577 | 81.2 |
| RGD [ | 21_08_2015/Aug’15 | Predicted | 7,667 | 7,667 | 100 |
Datasources used for G-D associations. ‘Curated’ refers to manually curated associations, ‘Experimental’ refers to associations drawn directly from genetic experimental observations, ‘Literature’ refers to associations automatically mined from literature and ‘Predicted’ refers to associations statistically inferred from animal models.
⋆ Not including 1,397 associations for which the molecule basis is unknown.
◇Threshold of 1e-7 was used.
⊲See S1 Article.
∘Extracted associations between gene and disease that were of the following predicates: AFFECTS; ASSOCIATED_WITH; AUGMENTS; CAUSES; PREDISPOSES; COEXISTS_WITH and NEG_ASSOCIATED_WITH as described in [8].
*Used the same parameters used by DisGeNET to extract predicted associations.
Fig 2Comparison of gene-disease (G-D) sources.
(A) shows the percentage spread of G-D associations from each integrated datasource across the 29 MeSH disease branches. (B) a boxplot showing the overlap of G-D associations between the ten datasources. x associations were picked at random and overlap between the other data sources identified (x = 1000). Note: n = number of data sources checked, red diamonds show the mean, open circles are outliers and the median is represented by the thick horizontal black lines.
Data sources, types, attributes and frequency used in integrated repositioning graph.
| Source | Version/Acc | NodeType | #Nodes | RelationType | #Rels | Attributes |
|---|---|---|---|---|---|---|
| UniProtKB [ | 2015_08 | Protein | 20,203 | - | - | UniProt UID |
| UniProt ID | ||||||
| Name | ||||||
| UniProtKB | 2015_08 | Gene | 19,744 | - | - | Entrez Gene Symbol |
| Entrez Gene ID | ||||||
| UniProtKB | 2015_08 | - | - | encoded_by | 19,903 | - |
| ORDO [ | 2/July’15 | Rare_Disease | 8,626 | - | - | Name |
| MESH | ||||||
| OMIM | ||||||
| UMLS | ||||||
| ORDO | 2/July’15 | - | - | part_of | 12,518 | - |
| ORDO | 2/July’15 | - | - | has_parent | 11,201 | - |
| MeSH [ | 2015/Aug’15 | Common_Disease | 11,735 | - | - | MeSH Header |
| MeSH | ||||||
| MeSH Tree | ||||||
| MeSH | 2015/Aug’15 | - | - | is_a | 23,829 | - |
| DrugBank [ | 4.3/July’15 | Small_Molecule | 7,469 | - | - | DBID |
| Name | ||||||
| Category | ||||||
| Group | ||||||
| DrugBank | 4.3/July’15 | - | - | binds_to | 14,250 | Action |
| ChEMBL [ | 20/Sep’15 | - | - | binds_to | 23,507 | Activity type |
| Activity value | ||||||
| ChEMBL | 20/Sep’15 | - | - | - | - | Drug mechanism |
| SIDER [ | 4/Aug’15 | - | - | has_indication | 4,488 | - |
| NDFRT [ | Aug’15 | - | - | has_indication | 4,396 | - |
| PREDICT [ | - | - | - | has_indication | 1,265 | - |
| CTD curated [ | - | - | - | has_indication | 18,540 | - |
| SIDER | 4/Aug’15 | - | - | has_side_effect | 67,934 | - |
| Scored gd | - | - | - | involved_in | 309,885 | Association score Directionality |
Data sources used in the creation of the repositioning dataset.
*Made up of 5,370 descriptor records and 6,365 supplementary records.
⋆532 drug activity types (including agonist and antagonist) were taken from ChEMBL and mapped to drugs in the dataset.
⊲Unique associations from the 16,306 integrated.
∘Unique associations from the 163,525 integrated.
◇3,459 G-D associations are annotated with the gene functionality resulting in a disease state, either loss-of-function (2,211) or gain-of-function (1,248).
Fig 3Semantic subgraph used during mining of the integrated network.
Subgraph represents the simplest approach to schematically represent the route from drug to disease using target-based approaches to drug repositioning. Through identifying mappings of the subgraph in our integrated dataset we aim to infer the red has_indication relations. Mappings are scored using the values captured in the Activity value and Association score attributes (shown in green) found on the binds_to and the involved_in relations, respectively. Note: in mappings ‘Disease’ can be either a Common_Disease or a Rare_Disease and a ‘Drug’ is an approved Small_Molecule.
Fig 4Identifying a therapeutic area of interest.
(A) Dark grey shows the number of diseases in each therapeutic area of the MeSH hierarchy. Light grey shows the number of those diseases that are not involved in any of the gene-disease associations captured in our network. Red shows the number of diseases that are involved in a gene-disease association as a percentage of the total number of diseases in that therapeutic area. (B) Dark grey shows the number of diseases in each therapeutic area of the MeSH hierarchy. Light grey shows the number of those diseases that currently do not have a small molecule treatment on the market. Red shows the number of diseases that do have a treatment on the market as a percentage of the total number of diseases in that therapeutic area. Note: please see S1 Table for disease area names.
Number of mappings for each disease type and therapeutic area post filtering.
| All Diseases | Common Disease | Rare Disease | |
|---|---|---|---|
| 275,934 (451,269) | 219,623 (369,124) | 56,311 (82,145) | |
| 55,875 (102,832) | 39,383 (73,501) | 16,492 (29,331) | |
| 54,635 (84,213) | 41,241 (66,536) | 13,394 (17,677) |
After applying filtering we were left with a set of mappings that inferred unique (no repeats) drug-disease associations. Numbers in brackets denote how many mappings inferred the unique associations.
Fig 5Validating inferred has_indication associations.
All 18,889 has_indication associations captured in our integrated network were extracted. These associations were used as a means of validating the ability of our approach to identify known has_indication associations. Note: For each disease category (ALL, C04 and C10) the set of known indications were pruned to only include those containing drugs included in the inferences made by our approach (totalling 17,883). Mapping was done using a Sim value of 0.633, this is equivalent to a distance of two nodes in the MeSH hierarchy.
Top 10 inferred associations involving unique neoplasm diseases.
| Drug ( | Gene | Disease ( | Type ( | Evidence | Score |
|---|---|---|---|---|---|
| Sunitinib ( | Gastrointestinal Stromal Tumors ( |
R ( | M(1.0) | 0.999 | |
| Ponatinib ( | Acute myeloid leukemia | R ( | A | 0.998 | |
| Dasatinib ( | Familial prostate cancer ( | R ( | C [ | 0.996 | |
| Ethinyl Estradiol ( | Breast Neoplasms ( | C | M(1.0) | 0.988 | |
| Dasatinib ( | Myelogenous, Chronic, BCR-ABL Positive ( | C | M(1.0) | 0.988 | |
| Pazopanib ( | Mastocytosis | R ( | - | 0.984 | |
| Afatinib ( | Stomach Neoplasms ( | C | - | 0.973 | |
| Sunitinib ( | Multiple endocrine neoplasia type 2B ( | R ( | - | 0.961 | |
| Sunitinib ( | Pheochromocytoma ( | C | C [ | 0.960 | |
| Sunitinib ( | Familial medullary thyroid carcinoma ( | R ( | P [ | 0.958 |
We present the top ranked 10 inferred has_indication associations involving neoplasms. All ranked associations are available for download. A disease is classed as Rare (R) if it maps to ORDO and Common (C) if it is only in MeSH and not mappable to an ORDO concept. Evidence: M = maps to indications in dataset with Sim 0.66 or above; A = approved; C = clinical trial; and P = scientific paper.
Top 10 inferred associations involving unique diseases of the nervous system.
| Drug ( | Gene | Disease ( | Type ( | Evidence | Score |
|---|---|---|---|---|---|
| Nitrendipine ( | Hypokalemic periodic paralysis ( | R ( | - | 0.999 | |
| Clonazepam ( | Juvenile myoclonic epilepsy ( | R ( | M (0.76) | 0.999 | |
| Mifepristone ( | Bulbospinal neuronopathy, X-linked recessive ( | C | - | 0.999 | |
| Memantine ( | Landau-Kleffner Syndrome ( | R ( | - | 0.996 | |
| Bromocriptine ( | Myoclonus-dystonia syndrome ( | R ( | - | 0.994 | |
| Roflumilast ( | Acrodysostosis ( | R ( | - | 0.991 | |
| Lisinopril ( | Alzheimer Disease ( | C | - | 0.991 | |
| Roflumilast ( | Stroke ( | C | 0.987 | ||
| Clonazepam ( | Epilepsy, Absence ( | C | M (1.0) | 0.991 | |
| Triazolam* ( | Generalized Epilepsy With Febrile Seizures Plus, Type 3 ( | C | - | 0.988 |
We present the top ranked 10 inferred has_indication associations involving unique diseases of the central nervous system. All ranked associations are available for download. A disease is classed as Rare (R) if it maps to ORDO and Common (C) if it is only in MeSH and not mappable to an ORDO concept. Evidence: M = maps to indications in dataset with Sim 0.66 or above; A = approved; C = clinical trial; and P = scientific paper. (*This drug has been withdrawn in the UK due to risk of psychiatric adverse drug reactions, but continues to be available in the U.S)