| Literature DB >> 26219079 |
Bethany Percha1, Russ B Altman2.
Abstract
The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.Entities:
Mesh:
Year: 2015 PMID: 26219079 PMCID: PMC4517797 DOI: 10.1371/journal.pcbi.1004216
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Example of a dependency graph for a Medline 2013 sentence.
(a) The raw sentence. (b) The complete dependency graph for the sentence. (c) The dependency path connecting the gene CYP3A4 with the drug rifampicin. (d) A more compact representation of the dependency path.
Selected dependency paths and representative sentences.
| Dependency path | Example sentence (PubMed ID) | Frequency | |
|---|---|---|---|
| [1] | [ |
| 1181 |
| [2] | [ | The mNQO activity was insensitive to | 452 |
| [3] | [ | The recommended therapy for stage III disease, based on clinical trials and by the Israeli Ministry of Health for 2006, includes | 338 |
| [4] | [ |
| 204 |
| [5] | [ |
| 118 |
| [6] | [ |
| 73 |
| [7] | [ |
| 71 |
| [8] | [ |
| 64 |
| [9] | [ |
| 57 |
| [10] | [ | These results suggest that | 53 |
| [11] | [ |
| 51 |
| [12] | [ | Tadalafil is mainly metabolized by cytochrome P450 (CYP) 3A4, and as | 30 |
| [13] | [ | When cells were cultured in a medium containing estrogen, | 29 |
| [14] | [ | The results of preclinical studies demonstrated that | 21 |
| [15] | [ |
| 17 |
The drug and gene names flanking each path are bolded. Some key abbreviations are listed here: appos: appositional modifier, amod: adjectival modifier, prep: prepositional modifier (if prep_of, the specific preposition used is “of”, if prep_to, it’s “to”, if prep_for, it’s “for”), nsubjpass: passive nominal subject, agent: complement of passive verb, dobj: direct object of active verb, nsubj: noun subject of active verb.
Summary of datasets for the PGx and drug-target relation extraction tasks.
In the dense dataset, the drug-gene pairs and dependency paths represented must have occurred at least five times in Medline. In the sparse dataset, the dependency paths must have occurred at least twice, and all drug-gene pairs connected by these paths were included, even if they only occurred once.
| Dataset | Task | Drug-gene pairs | Dependency paths | Nonzero matrix elements (sparsity) | Known relationships in dataset | Optimal row and column cluster numbers |
|---|---|---|---|---|---|---|
| Dense | PGx | 3514 | 1232 | 10,007 (99.8%) | 290 |
|
| Drug-target | 410 | |||||
| Sparse | PGx | 14,052 | 7272 | 29,456 (99.97%) | 545 |
|
| Drug-target | 779 |
Fig 2Classifier performance at the task of recognizing (a) PGx associations (dense matrix), (b) drug-target associations (dense matrix), (c) PGx associations (sparse matrix) and (d) drug-target associations (sparse matrix).
Fig 3Example of ITCC output for a small matrix consisting of drug-CYP3A4 pairs and their associated dependency paths.
The top heatmap shows the original data after the clustering was performed. An orange square represents an observed path (column) between a given drug-gene pair (row). The bottom heatmap shows the approximate distribution arising from a single ITCC run.
Some dependency paths that cluster together with relatively high frequency.
| First Pattern | Second Pattern | Frequency of co-clustering |
|---|---|---|
| [ | [ | 0.59 |
|
|
| |
| [ | [ | 0.31 |
|
|
| |
| [ | [ | 0.12 |
|
|
| |
| [ | [ | 0.11 |
|
|
| |
| [ | [ | 0.07 |
|
|
| |
| [ | [ | 0.03 |
|
|
|
The first line of each row shows the dependency path, the second an example of what that path would look like in the raw text. The symbol D represents the drug and G represents the gene.
Fig 4Dendrogram illustrating the semantic relationships among 3514 drug-gene pairs.
In this dendrogram, the leaves represent 3514 drug-gene pairs that co-occur in Medline sentences at least 5 times, and we have cut the dendrogram at various levels (illustrated by the red lines in the interior of the dendrogram) to produce the colored clusters shown around the edges. Drug-gene pairs that are known drug-target relationships from DrugBank are denoted by blue dots, and those that are known PGx relationships from PharmGKB are denoted by orange dots. The heights of the turquoise bars are proportional to how often the corresponding drug-gene pairs co-occur in Medline sentences (a proxy for how well-documented they are).
Explanation of the clusters shown in Fig 4.
Clusters with 20 or fewer members are not described in the table in the interest of space.
| Theme | Cluster size | Key word/phrase | Example drug-gene pair | % PGx | % Drug-Target | Comment | |
|---|---|---|---|---|---|---|---|
|
| Synthesis | 34 | synthase | aldosterone, P450aldo | 0.0 | 17.6 | Many of the drugs in this cluster are endogenous compounds. |
|
| |||||||
|
| Activation | 134 | increased activity | curcumin, caspase-8 | 9.0 | 6.7 | In this cluster, activation is frequently associated with phosphorylation. |
|
| |||||||
|
| Enzyme activity | 45 | activity | estradiol, E2DH | 6.7 | 6.7 | The gene is typically an enzyme that chemically modifies the drug. A few transporter pairs are also present, such as (ornithine, ORNT1). |
|
| |||||||
|
| Substrates | 64 | substrate | aminopterin, hOAT1 | 29.7 | 7.8 | Relatively few mentions of “metabolism” compared to 3b and 3c. Reference to transporters such as P-gp, hOAT1, SERT. |
|
| |||||||
|
| Metabolism | 131 | metabolized | rosiglitazone, CYP2C8 | 37.4 | 0.8 | Frequent reference to liver cytochromes such as CYP3A4 and CYP2D6. |
|
| |||||||
|
| Substrates that (often) also affect activity | 70 | substrate | efavirenz, CYP2B6 | 37.1 | 5.7 | The drug-gene pairs in this sub-cluster are mentioned together less frequently in the literature than those in 3a or 3b. |
|
| |||||||
|
| Third party involvement | 28 | Inhibits… to/by | rapamycin, PHAS-I | 3.6 | 3.6 | All of the drug-gene pairs in this cluster are connected by exactly one path, and the paths are unusual. They often refer to the involvement of a third molecule of some kind, raising the possibility of three-way interactions among drugs and genes. |
|
| |||||||
|
| Coadministration | 172 | in presence of | sunitinib, IFN-alpha | 0.6 | 0.6 | This cluster illustrates the blurry line between drugs and genes (proteins) since many drugs (in this case, IFN-alpha) are also proteins. |
|
| |||||||
|
| Increased production | 141 | induced, production, increase | PGE2, VEGF | 1.4 | 1.4 | Cluster 7 is distinguished by the presence of many proteins that act as drugs. These include IL-2, gp120, and PGE2. |
|
| |||||||
|
| Raised levels | 52 | levels, production | cisplatin, Rad51 | 5.8 | 3.8 | Similar in theme to 7a-c, descriptions from this cluster involve drugs that raise protein levels. Sentences mostly report experimental results. |
|
| |||||||
|
| Antagonists | 101 | antagonist, blocker | plerixafor, CXCR4 | 11.9 | 39.6 | Cluster 8 references inhibition more generally. EBC learns that antagonism (cluster 8a) is a subclass of inhibition. |
|
| |||||||
|
| Inhibition | 380 | inhibitor of, inhibits | sildenafil, PDE5 | 18.7 | 37.9 | Cluster 8c is large and includes some interesting smaller subclusters, such as antibodies against particular proteins, and inhibition, specifically, of protein activity or phosphorylation. |
|
| |||||||
|
| Specific drug-protein interactions | 56 | target, kinase, protein | hyaluronate, GHAP | 3.6 | 14.3 | These are pairs where the protein is named for its function, which involves a particular action on the drug in question. In the second sentence, the pair is pyridoxal/Pdxk. |
|
| |||||||
|
| Inhibitors | 70 | inhibitor, substrate, metabolized | verapamil, P-gp | 30.0 | 4.3 | Many drugs act as both inhibitors and substrates of proteins, including ritonavir/CYP3A4, quinidine/P-gp, and omeprazole/CYP2C19, all found in cluster 10. |
|
| |||||||
|
| Inhibition | 148 | inhibitor of; | miglitol, alpha-glucosidase | 12.2 | 27.0 | There is little difference in meaning between this cluster and cluster 8c, except that there are variations in phrasing that are more common to one or the other cluster. |
|
| |||||||
|
| Receptors | 80 | receptor(s), gene, antagonist | urokinase, uPAR | 1.3 | 32.5 | Cluster 12 contains a subcluster primarily composed of antagonist pairs, and a larger subcluster involving pairs where the gene is described as the “receptor” for the drug. |
|
| |||||||
|
| Activation | 112 | activated, increased expression | simvastatin, Rac1 | 0.0 | 0.0 | This is the largest cluster with zero representation of either PGx or drug-target relationships. The pair in the second sentence is estradiol/HO-1. |
|
| |||||||
|
| Agonists | 129 | agonist, hormone, analog | sumatriptan, 5-HT1B | 7.0 | 33.3 | |
|
| |||||||
|
| Activation / stimulation | 138 | activates, induced, stimulates | resveratrol, AMPK | 1.4 | 4.3 | Focus is similar to cluster 13 but notably, there is relatively little reference to expression. |
|
| |||||||
|
| Protein binding | 28 | binds to; binding to | glibenclamide, SUR1 | 7.1 | 35.7 | |
|
| |||||||
|
| Experimental methods | 151 | treatment, concentration, toxicities, mice, cells | dasatinib, STAT3 | 1.8 | 2.4 | This cluster includes many sentences describing observed effects on expression/activity, but not as many as other nearby clusters. Cluster 17d is also home to one insidious error: the term “DLTS'' (“dose-limiting toxicities'') identified as a gene. |
|
| |||||||
|
| Effect on expression | 148 | investigate effect on | colchicine, MEFV | 1.3 | 0.0 | If directionality of effect is reported in cluster1 17e, it is most often inhibition. |
|
| |||||||
|
| Induction of expression | 123 | increased/induced expression | imatinib, CXCR4 | 1.6 | 1.6 | Typically experimental results reporting a positive effect of the drug on gene expression. |
|
| |||||||
|
| Effect on expression, usually induction | 65 | by expression, inducer of, was induced by | melatonin, bcl-2 | 1.5 | 1.5 | In many sentences, we know only that the effect of the drug on the expression of the gene was investigated. If directionality of effect is reported, it is most often induction. |
|
| |||||||
|
| Inhibition of activation | 41 | inhibited / suppressed activation (of | fluvastatin, NF-kappaB | 4.9 | 4.9 | This is another set of three-way interactions where the drug is suppressing activation of the protein by some other molecule. |
|
| |||||||
|
| Effect on expression, usually inhibition | 54 | expression by, expression of, inhibited expression, decreased, reduced | montelukast, iNOS | 0.0 | 3.7 | There is a fairly even split in this cluster between methods and results. |
|
| |||||||
|
| Decreased levels | 59 | decreased levels, inhibited expression, suppression | gefitinib, Rad51 | 1.7 | 0.0 | Note that the example sentence here is identical to that in cluster 7d, but the drug in question is different. This single sentence describes two separate relationships with different characters. |
|
| |||||||
|
| Inhibited activity / expression | 76 | inhibited activity, inhibited expression | minocycline, MMP-2 | 3.9 | 10.5 | Focus is experimental observations, as opposed to stated prior knowledge (the dominant theme in cluster 8c). |
|
| |||||||
|
| Inhibition | 78 | inhibited; | trastuzumab, HER2 | 10.3 | 17.9 | There are some subtle differences between cluster 22 and cluster 8. Most notably, cluster 22 never references antagonism. Cluster 22 also contains some descriptions that never occur in cluster 8, such as “inhibited induction of” and “inhibited activation”. Similarly, cluster 8 contains some descriptions (besides those of antagonists) that never occur in cluster 22, such as “inhibitors of |
|
| |||||||
|
| Protein binding (and) affects activity | 33 | activity, protein, binds | gp120, DC-SIGN | 0.0 | 12.1 | This small cluster actually contains two smaller subclusters, one of which focuses on protein activity and the other on binding. The descriptions of these drug-gene pairs include some different variants of those in clusters 15 and 25f. |
|
| |||||||
|
| Patients with disease (error) | 92 | treatment, patients, disease | glyburide, NIDDM | 3.3 | 2.2 | This cluster illustrates one problem associated with using simple string matching to lexicons to identify drugs and genes: COPD and NIDDM are both gene names. Notably, however, these types of errors are “quarantined” together in the dendrogram. |
|
| |||||||
|
| Affects secretion / release | 50 | secretion | octreotide, calcitonin | 0.0 | 0.0 | Genes (proteins) in this cluster are generally hormones or cytokines, such as gastrin, lactogen, IL-1RA, and IL-13. |
|
| |||||||
|
| Expression | 252 | on expression, by expression, inhibited / increased expression | indomethacin, MCP-1 | 2.0 | 2.0 | The directionality of the drug's effect on expression varied within this cluster. The sentences mostly report experimental findings. |
|
| |||||||
|
| Affects activity | 38 | activity, on activity | amitriptyline, EAAT3 | 2.6 | 10.5 | |
|
| |||||||
Fig 5Dendrogram illustrating predictions of novel PGx and drug-target relationships among 3514 drug-gene pairs.
The height of the bars corresponds to EBC's certainty that the pair in question represents a relationship of the corresponding type (orange: PGx relationships, blue: drug-target relationships). The dots represent known PGx and drug-target relationships, as in Fig 4.
Top 20 predictions of new drug-gene relationships for PharmGKB, and whether a PGx relationship has been documented in the literature.
| Candidate drug-gene pair | Relative certainty | Literature reference (PMID) | Comment | ||
|---|---|---|---|---|---|
|
| omeprazole, CYP2C19 | 1.000 | 11069321 |
| Individual polymorphisms of CYP2C19 already associated with omeprazole in PharmGKB. |
|
| mexiletine, CYP1A2 | 0.995 | 9690950 |
| |
|
| fentanyl, P-gp | 0.994 | 17192767 |
| |
|
| voriconazole, CYP3A4 | 0.986 | 17433262 |
| |
|
| cyclosporine, CYP3A4 | 0.983 | 18978522 |
| Association listed in PharmGKB as “ambiguous”. |
|
| duloxetine, CYP1A2 | 0.983 | 18307373 |
| |
|
| fluconazole, UGT2B7 | 0.982 | 16542204 |
| |
|
| montelukast, CYP2C8 | 0.973 | 21838784 |
| |
|
| dydrogesterone, AKR1C1 | 0.968 | 20727920 |
| |
|
| voriconazole, CYP2C9 | 0.966 | 16940139 |
| |
|
| imipramine, FMO1 | 0.962 | 19262426 |
| Experiment conducted in mice. |
|
| ticlopidine, CYP2C19 | 0.961 | 21178986 |
| |
|
| moclobemide, MAO-B | 0.960 | 7586937 | In this article, MAO-B activity was studied in relation to moclobemide response, but specific polymorphisms were not investigated. | |
|
| ritonavir, P-gp | 0.958 | 16184031 |
| Association listed in PharmGKB as “ambiguous”. |
|
| cyclosporin, MDR1 | 0.955 | 15116055 |
| |
|
| cyclosporin, P-gp | 0.952 | 15116055 |
| Same gene as 15. |
|
| vinblastine, P-gp | 0.951 | 16917872 |
| Association listed in PharmGKB as “ambiguous”. |
|
| amprenavir, CYP3A4 | 0.950 | 9649346 |
| |
|
| perazine, CYP1A2 | 0.945 | 11026737 |
| |
|
| lopinavir, ABCB1 | 0.939 | 21743379 |
|
*** indicates that an association has been demonstrated experimentally between changes in the expression/activity of the gene/protein and the efficacy of the drug
** indicates that such an association is likely, but has not yet been studied
* indicates that the association has been studied experimentally, and the experiment refuted the association. Here we include only associations between pharmaceutical compounds and single genes; predicted associations involving endogenous compounds and/or groups of genes are included in the supplement, however.
Top 20 predictions of new drug-target relationships for DrugBank.
| Candidate drug-gene pair | Relative certainty | Literature reference (PMID) | Comment | ||
|---|---|---|---|---|---|
|
| ketanserin, 5-HT2A | 1.000 | 16615363 |
| Ketanserin not in DrugBank. |
|
| losartan, A-II | 0.998 | 24807206 |
| “A-II” refers to the angiotensin type II receptor. In DrugBank this is listed as “Type-1 angiotensin II receptor”. |
|
| cangrelor, P2Y12 | 0.993 | 20048234 |
| Cangrelor not in DrugBank. |
|
| phencyclidine, nAChR | 0.992 | 9862757 |
| Phencyclidine is a noncompetitive inhibitor of nAChR. |
|
| anakinra, IL-1 | 0.991 |
| ||
|
| bosentan, endothelin-1 | 0.987 |
| ||
|
| imatinib, EGFR | 0.985 | 15887238 |
| Imatinib's effect on EGFR is ambiguous. It is not likely to be a direct target. |
|
| propanolol, Beta2 | 0.984 |
| ||
|
| carvedilol, Alpha1 | 0.984 |
| ||
|
| MK-571, leukotriene | 0.983 |
| MK-571 is unknown to DrugBank. | |
|
| zafirlukast, leukotriene | 0.981 |
| ||
|
| degarelix, GnRH | 0.980 |
| GnRH receptor listed in DrugBank as “Gonadotropin-releasing hormone receptor”. Complicated because degarelix often referred to as “GnRH antagonist” but the target is actually the GnRH | |
|
| nutlin-3, Mdm2 | 0.980 | 18646312 |
| Nutlin-3 disrupts the p53-Mdm2 complex. Nutlin-3 is unknown to DrugBank. |
|
| genistein, EGFR | 0.979 | 21603581 |
| Interestingly, authors found that genistein promotes cancer progression and increases EGFR signaling. |
|
| montelukast, leukotriene | 0.977 |
| ||
|
| aprepitant, NK-1 | 0.977 |
| NK-1 listed in DrugBank as “Substance-P receptor”. | |
|
| staurosporine, calmodulin | 0.975 | 1846174 |
| Staurosporine inhibits calmodulin-dependent protein kinase, not calmodulin. |
|
| nutlin-3, Hdm2 | 0.975 | 19696166 |
| Nutlin-3 is unknown to DrugBank. Hdm2 refers to the human version of the Mdm2 protein (13, above). |
|
| tropisetron, 5-HT4 | 0.974 | 11243577 |
| Tropisetron is unknown to DrugBank. |
|
| basiliximab, CD25 | 0.972 | 12591363 |
| CD25 is listed in DrugBank as “Interleukin-2 receptor subunit alpha”. |
*** indicates that the drug has been shown experimentally to have modified the activity of the gene/protein
** means that the interaction is known to DrugBank but is listed under an alternate drug or gene name
* means the interaction has been studied and is unlikely; P refers to a particular type of parser error in which the ligand of a receptor is mistaken for that receptor; L refers to a lexicon error (see Discussion).