| Literature DB >> 25725057 |
Guillermo Palma1, Maria-Esther Vidal2, Eric Haag2, Louiqa Raschid2, Andreas Thor2.
Abstract
Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/Entities:
Mesh:
Substances:
Year: 2015 PMID: 25725057 PMCID: PMC4343076 DOI: 10.1093/database/bau123
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Annotation graph of Clinical Trials from LinkedCT (blue ovals). Interventions are green rectangles; conditions are pink rectangles and CV terms from the NCIt are red ovals.
Figure 2.Annotation subgraph representing the annotations of Brentuximab vedotin and Catumaxomab. Interventions are green rectangles; conditions are pink rectangles and ontology terms in the NCIt are red circles. (a) Weighted bipartite graph for Brentuximab vedotin and Catumaxomab. (b) 1–1 maximum weight bipartite matching for Brentuximab vedotin and Catumaxomab
Figure 3.Bipartite graphs for drugs Brentuximab vedotin and Catumaxomab. For legibility, only the value of the highest matching edges is shown in (a). (a) Weighted bipartite graph for Brentuximab vedotin and Catumaxomab. (b) 1-1 maximum weight bipartite matching for Brentuximab vedotin and Catumaxomab.
Description of the datasets
| Dataset | Description |
|---|---|
| 1 | Thirty pairs of diseases from the Mayo Clinic benchmark |
| 2 | Twelve anticancer drugs in the intersection of monoclonal antibodies and antineoplastic agents |
| 3 | Collection of pairs of proteins from UniProt |
| 4 | Collection of drugs and targets interactions from DrugBank, |
| 5 | Collection of drug and target interactions collected by Yamanishi |
ahttp://www.uniprot.org/.
bhttp://www.drugbank.ca/.
Description of the ontologies used in the evaluation datasets
| Ontology | NCIt | SNOMED CT | MeSH | GO |
|---|---|---|---|---|
| Version | 12.05d | June 2012 | June 2012 | August 2008 |
| Number of nodes | 93 788 | 395 346 | 26 580 | 26 539 |
| Number of arcs | 104 439 | 539 245 | 36 212 | 43 213 |
| Used in Dataset | 1 and 2 | 1 | 1 | 3 |
Similarity measures for pairs of proteins in dataset 3
| simUI (UI) | Jaccard index on the GO annotations of the proteins. |
| simGIC (GI) ( | Jaccard index where GO annotations of the compared proteins are weighted by their IC. |
| Resnik ( | Resnik’s measure where similarity of two terms is the average of IC of pairs of common ancestors. |
| Resnik ( | Resnik’s measure where similarity corresponds to the maximum value of IC of pairs of common ancestors. |
| Resnik ( | Resnik’s measure where similarity corresponds to the average of IC of pairs of disjunctive common ancestors (DCA). |
| Lin ( | Lin’s measure that relates IC of the average of IC of pairs of common ancestors to IC of compared terms. |
| Lin ( | Lin’s measure that relates IC of the maximum value of IC of pairs of common ancestors to IC of compared terms. |
| Lin Best-Match ( | Lin’s measure that relates the IC of the average of the IC of pairs of DCA to IC of compared terms. |
| Jiang and Conrath ( | Jiang and Conrath’s measure where IC of average of IC of pairs of common ancestors is related to IC of compared terms. |
| Jiang and Conrath ( | Jiang and Conrath’s measure where IC of the maximum IC of pairs of common ancestors is related to IC of compared terms. |
| Jiang and Conrath ( | Jiang and Conrath’s measure where the IC of the average IC of pairs of DCA is related to IC of compared terms. |
Statistics of dataset 4 obtained from Perlman et al. (20)
| Number of drugs | Number of targets | Number of drug–target interactions |
|---|---|---|
| 315 | 250 | 1306 |
Similarity measures for drugs and targets in dataset 4 (20)
| Drug–drug similarity measures | |
| Chemical based | Jaccard similarity of the SMILES fingerprints of the drugs. |
| Ligand based | Jaccard similarity between protein receptor families extracted via matched ligands with drugs’ SMILES on the SEA tool. |
| Expression based | Spearman’s correlation of gene expression responses to drugs using connectivity map. |
| Side-effect-based | Jaccard similarity between drugs side-effects from SIDER. |
| Annotation based | Semantic similarity of drugs based on the WHO ATC classification system. |
| Target–target similarity measures | |
| Sequence based | Smith and Waterman scores ( |
| Protein based | Shortest paths between human protein–protein interactions of the drugs. |
| GO based | Semantic similarity based on GO annotations computed using csbl.go package of R. |
ahttp://blast.ncbi.nlm.nih.gov/.
bhttp://csbi.ltdk.helsinki.fi/csbl.go/.
Statistics of dataset 5 downloaded from http://web. kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/ (21)
| Statistics | Nuclear receptor | GPCR | Ion channel | Enzyme |
|---|---|---|---|---|
| Number of drugs (D) | 54 | 23 | 210 | 445 |
| Number of targets (T) | 26 | 95 | 204 | 664 |
| Number of D-T interactions | 90 | 635 | 1476 | 2926 |
Identifiers of the 30 pairs of diseases from the Mayo Clinic benchmark
| ID | Medical terms |
|---|---|
| 1 | Renal insufficiency – kidney failure |
| 2 | Heart – myocardium |
| 3 | Stroke – infarction |
| 4 | Abortion – miscarriage |
| 5 | Delusions – schizophrenia |
| 6 | Congestive heart failure – pulmonary edema |
| 7 | Metastasis – adenocarcinoma |
| 8 | Calcification – stenosis |
| 9 | Diarrhea – stomach cramps |
| 10 | Mitral stenosis – atrial fibrillation |
| 11 | Chronic obstructive pulmonary disease – lung infiltrates |
| 12 | Rheumatoid arthritis – lupus |
| 13 | Brain tumor – intracranial hemorrhage |
| 14 | Carpal tunnel syndrome – osteoarthritis |
| 15 | Diabetes mellitus – hypertension |
| 16 | Acne – syringe |
| 17 | Antibiotic – allergy |
| 18 | Cortisone – total knee replacement |
| 19 | Pulmonary embolism – myocardial Infarction |
| 20 | Pulmonary fibrosis – lung Cancer |
| 21 | Cholangiocarcinoma – colonoscopy |
| 22 | Lymphoid hyperplasia – laryngeal cancer |
| 23 | Multiple Sclerosis – psychosis |
| 24 | Appendicitis – osteoporosis |
| 25 | Rectal polyp – aorta |
| 26 | Xerostomia – liver cirrhosis, alcoholic |
| 27 | Peptic ulcer – myopia |
| 28 | Depression – cellulitis |
| 29 | Varicose vein – entire knee meniscus |
| 30 | Hyperlipidemia – metastasis |
Similarity dataset 1: (1 – dtax) and (1 – dps) for SNOMED, MeSH and NCIt
| ID | Phy | Cod | SNOMED | MeSH | NCIt | |||
|---|---|---|---|---|---|---|---|---|
| 1 – | 1 – | 1 – | 1 – | 1 – | 1 – | |||
| 1 | ||||||||
| 2 | 3.00 | 0.64 | 0.67 | 0.20 | 0.11 | |||
| 3 | 2.80 | 0.31 | 0.31 | 0.67 | ||||
| 4 | 3.00 | 0.00 | 0.00 | |||||
| 5 | 3.00 | 2.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.80 | 0.67 |
| 6 | 3.00 | 0.50 | 0.46 | 0.00 | 0.00 | 0.59 | ||
| 7 | 0.14 | 0.00 | 0.00 | |||||
| 8 | 2.70 | 0.38 | 0.00 | 0.00 | 0.40 | 0.25 | ||
| 9 | 2.30 | 0.29 | 0.75 | 0.63 | 0.42 | 0.30 | ||
| 10 | 1.30 | 0.46 | 0.50 | 0.33 | 0.53 | 0.36 | ||
| 11 | 1.90 | 0.70 | — | — | 0.13 | 0.07 | ||
| 12 | 0.33 | 0.11 | 0.86 | 0.75 | ||||
| 13 | 2.00 | 0.63 | 0.57 | 0.63 | 0.50 | 0.09 | ||
| 14 | 2.00 | 0.33 | 0.33 | 0.33 | 0.20 | |||
| 15 | 0.64 | 0.17 | 0.09 | |||||
| 16 | 2.00 | |||||||
| 17 | 1.70 | |||||||
| 18 | 1.70 | |||||||
| 19 | 1.20 | 0.36 | 0.29 | 0.29 | 0.63 | |||
| 20 | 1.70 | 1.40 | 0.75 | 0.63 | 0.67 | 0.50 | 0.60 | 0.50 |
| 21 | 1.30 | |||||||
| 22 | 1.30 | 0.43 | 0.33 | 0.36 | 0.22 | |||
| 23 | 0.44 | 0.29 | 0.33 | 0.20 | ||||
| 24 | 0.31 | 0.31 | 0.50 | 0.36 | ||||
| 25 | — | — | ||||||
| 26 | 0.14 | 0.08 | ||||||
| 27 | 0.23 | 0.29 | 0.15 | 0.08 | ||||
| 28 | 0.31 | 0.18 | ||||||
| 29 | 0.13 | 0.07 | — | — | ||||
| 30 | 0.33 | 0.20 | ||||||
Empty cells (—) represent terms that do not appear in the ontology. Values highlighted in bold show high correlation between the relevance given by the physician, coder and the measures. IDs are presented in Table 7
nDCG of (1 – dtax) and (1 – dps)
| Measure | SNOMED | MeSH | NCIt | |||
|---|---|---|---|---|---|---|
| Physician | Coder | Physician | Coder | Physician | Coder | |
| 1 − | 0.837 | 0.961 | 0.977 | 0.957 | 0.959 | 0.959 |
| 1 − | 0.966 | 0.963 | 0.976 | 0.987 | 0.959 | 0.959 |
Pairwise comparison of alemtuzumab with the rest of the 11 drugs
| Pair drug | AnnSim | 1 – | 1 – | HeteSim |
|---|---|---|---|---|
| Alemtuzumab - Bevacizumab | 0.263 | 0.670 | 0.500 | 0.001 |
| Alemtuzumab - Brentuximab vedotin | 0.140 | 0.364 | 0.222 | 0.000 |
| Alemtuzumab - Catumaxomab | 0.199 | 0.364 | 0.222 | 0.000 |
| Alemtuzumab - Cetuximab | 0.359 | 0.727 | 0.571 | 0.000 |
| Alemtuzumab - Edrecolomab | 0.037 | 0.727 | 0.571 | 0.000 |
| Alemtuzumab - Gemtuzumab | 0.046 | 0.500 | 0.333 | 0.000 |
| Alemtuzumab - Ipilimumab | 0.482 | 0.727 | 0.571 | 0.005 |
| Alemtuzumab - Ofatumumab | 0.468 | 0.727 | 0.571 | 0.002 |
| Alemtuzumab - Panitumumab | 0.422 | 0.727 | 0.571 | 0.000 |
| Alemtuzumab - Rituximab | 0.409 | 0.727 | 0.571 | 0.002 |
| Alemtuzumab - Trastuzumab | 0.319 | 0.727 | 0.571 | 0.000 |
HeteSim assumes perfect matching between annotations and assigns low similarity values.
Identifiers of the 12 anticancer drugs in the intersection of monoclonal antibodies and antineoplastic agents
| ID | Drug | Annotation count |
|---|---|---|
| 1 | Alemtuzumab | 39 |
| 2 | Bevacizumab | 136 |
| 3 | Brentuximab vedotin | 3 |
| 4 | Catumaxomab | 7 |
| 5 | Cetuximab | 50 |
| 6 | Edrecolomab | 1 |
| 7 | Gemtuzumab | 1 |
| 8 | Ipilimumab | 22 |
| 9 | Ofatumumab | 18 |
| 10 | Panitumumab | 22 |
| 11 | Rituximab | 100 |
| 12 | Trastuzumab | 18 |
Average similarity and standard deviation (avg; std) when each is compared with 11 other drugs (antineoplastic agents and monoclonal antibodies)
| ID | AnnSim | (1 – | (1 – | HeteSim |
|---|---|---|---|---|
| 1 | (0.286; 0.161) | (0.635; 0.150) | (0.479; 0.146) | (0.001; 0.002) |
| 2 | (0.206; 0.173) | (0.636; 0.152) | (0.479; 0.146) | (0.002; 0.002) |
| 3 | (0.206; 0.125) | (0.433; 0.093) | (0.284; 0.091) | (0.002; 0.007) |
| 4 | (0.244; 0.106) | (0.416; 0.066) | (0.269; 0.061) | (0.002; 0.003) |
| 5 | (0.303; 0.189) | (0.691; 0.163) | (0.547; 0.171) | (0.003; 0.004) |
| 6 | (0.157; 0.211) | (0.691; 0.162) | (0.547; 0.171) | (0.004; 0.014) |
| 7 | (0.157; 0.219) | (0.539; 0.045) | (0.375; 0.046) | (0.000 0.000) |
| 8 | (0.363; 0.208) | (0.691; 0.163) | (0.547; 0.171) | (0.004; 0.003) |
| 9 | (0.302; 0.159) | (0.692; 0.162) | (0.547; 0.171) | (0.003; 0.007) |
| 10 | (0.358; 0.212) | (0.692; 0.162) | (0.547; 0.171) | (0.007; 0.014) |
| 11 | (0.222; 0.169) | (0.691; 0.163) | (0.547; 0.171) | (0.001; 0.001) |
| 12 | (0.304; 0.175) | (0.692; 0.162) | (0.547; 0.171) | (0.002; 0.003) |
| Average | (0.259; 0.176) | (0.625; 0.137) | (0.476; 0.141) | (0.003; 0.005) |
IDs are presented in Table 11.
Spearman’s correlation for AnnSim and (1 − dtax) (SRank1) and the correlation for AnnSim and (1 − dps) (SRank2)
| ID | SRank1 | SRank2 |
|---|---|---|
| 1 | 0.625 | 0.625 |
| 2 | 0.505 | 0.543 |
| 3 | 0.752 | 0.752 |
| 4 | 0.348 | 0.339 |
| 5 | 0.523 | 0.507 |
| 6 | −0.318 | −0.318 |
| 7 | 0.511 | 0.466 |
| 8 | 0.502 | 0.502 |
| 9 | 0.382 | 0.411 |
| 10 | 0.514 | 0.525 |
| 11 | 0.311 | 0.311 |
| 12 | 0.350 | 0.364 |
IDs are presented by Table 11.
Average similarity and standard deviation (avg; std) of AnnSim for 7 out of the 12 anticancer drugs in the intersection of monoclonal antibodies and antineoplastic agents
| ID | Drug | AnnSim values |
|---|---|---|
| 1 | Alemtuzumab | (0.757; 0.315) |
| 2 | Bevacizumab | (0.702; 0.285) |
| 5 | Cetuximab | (0.738; 0.143) |
| 7 | Gemtuzumab | (0.757; 0.316) |
| 10 | Panitumumab | (0.254;0.130) |
| 11 | Rituximab | (0.757; 0.315) |
| 12 | Trastuzumab | (0.636; 0.156) |
Annotations correspond to NCIt terms of the diseases associated with these drugs at the DrugBank SPARQL endpoint.
Figure 4.Comparison of AnnSim with SeqSim and similarity measures from Table 3. Results are produced by CESSM for GO BP terms. (a) Average values for AnnSim, the measures in Table 3 and SeqSim. (b) Plot of AnnSim and SeqSim scores (Pearson’s correlation of 0.65). The similarity measures are simUI (UI), simGIC (GI), Resnik’s Average (RA), Resnik’s Maximum (RM), Resnik’s Best-Match Average (RB), Lin’s Average (LA), Lin’s Maximum (LM), Lin’s Best-Match Average (LB), Jiang and Conrath’s Average (JA), Jiang and Conrath’s Maximum (JM), Jiang and Conrath’s Best-Match Average (JB).
Pearson’s correlation coefficient between the three standards of evaluation and the 12 similarity measures on dataset 3
| Similarity measure | SeqSim | EC | Pfam | |||
|---|---|---|---|---|---|---|
| GI | <0.01 | 0.3981 | 0.4468 | 0.4547 | 0.1593 | |
| UI | 0.7304 | <0.01 | 0.4023 | 0.1810 | 0.4505 | 0.0440 |
| RA | 0.4068 | <0.01 | 0.3022 | < 0.01 | 0.3232 | <0.01 |
| RM | 0.3027 | <0.01 | 0.3076 | <0.01 | 0.2627 | <0.01 |
| RB | 0.7397 | <0.01 | <0.01 | 0.4588 | <0.01 | |
| LA | 0.3407 | <0.01 | 0.3041 | <0.01 | 0.2866 | <0.01 |
| LM | 0.2540 | <0.01 | 0.3134 | <0.01 | 0.2064 | <0.01 |
| LB | 0.6369 | < 0.01 | 0.4352 | <0.01 | 0.3727 | <0.01 |
| JA | 0.2164 | <0.01 | 0.1931 | <0.01 | 0.1732 | <0.01 |
| JM | 0.2350 | <0.01 | 0.2541 | <0.01 | 0.1649 | <0.01 |
| JB | 0.5864 | <0.01 | 0.3707 | <0.01 | 0.3319 | <0.01 |
| AnnSim | 0.6510 | – | 0.3926 | – | – |
The P values represent the probability of obtaining the correlation coefficient for AnnSim, EC and Pfam assuming the correlation coefficient of other 11 similarity measures. The higher correlation in each standard of evaluation is highlighted in bold.
Average similarity of the 259 clusters of the clustering obtained using the an EM algorithm for each drug–drug measure on 310 drugs
| AnnSimseq | AnnSimdist | AnnSimgo | ATC | Chem. | Ligand | CMap | SideEff. |
|---|---|---|---|---|---|---|---|
| 0.8939 | 0.8939 | 0.8939 | 0.9129 | 0.8737 | 0.8727 | 0.8304 | 0.8746 |
Figure 5.Distribution of the number of clusters of the clustering obtained by four drug–drug similarity measure.
Description of three clusters obtained using AnnSim measure and the EM clustering algorithm of WEKA
| No. of elements in the cluster | DrugBank drug categories In the cluster | No. of drugs with this category |
|---|---|---|
| 10 | Immunosuppressive agents | 1 |
| Neuroprotective agents | 1 | |
| Antipruritic agents | 1 | |
| Antiemetics | 1 | |
| Anti-asthmatic agents | 1 | |
| 1 | ||
| 1 | ||
| Anti-allergic agents | 1 | |
| Steroidal | 1 | |
| Adrenergic agents | 3 | |
| Antineoplastic agents | 1 | |
| “Antineoplastic agents | 1 | |
| 6 | Sympathomimetic | 1 |
| Anti-anxiety agents | 1 | |
| Vasodilator agents | 1 | |
| Sympathomimetics | 1 | |
| Anti-arrhythmia agents | ||
| Cardiotonic agents | 1 | |
| EENT drugs | 1 | |
| Sympatholytics | 3 | |
| Antihypertensive agents | 4 | |
| 6 | Nucleic acid synthesis inhibitors | 3 |
| “Antibiotics | 1 | |
| Anti-bacterial agents | 1 | |
| Enzyme inhibitors | 1 | |
| Photosensitizing agents | 1 | |
| Antibiotics | 1 | |
| Analgesics | 1 | |
| Quinolones | 2 | |
| Antitubercular agents | 1 | |
| Antineoplastic agents | 2 |
One cluster with 10 elements and two with six elements are shown. We highlight in bold similar category terms or terms with high frequency. Cluster with nine elements, their targets and frequency of interactions.
Figure 6.Distribution of the number of cluster of our gold standard clustering.
Jaccard similarity coefficient between each drug–drug measure clustering and the ground truth clustering
| AnnSimseq | AnnSimdist | AnnSimgo | ATC | Chem. | Ligand | CMap | SideEff. |
|---|---|---|---|---|---|---|---|
| 0.5657 | 0.5657 | 0.5657 | 0.7175 | 0.7512 | 0.7431 | 0.7045 | 0.7211 |
Comparison of clusterings produced by K means with 259 centers for AnnSim and Sim (drug–drug similarity measure computed by SIMCOMP)
| Enzyme | GPCR | ion | nr | ||||
|---|---|---|---|---|---|---|---|
| AnnSim | Sim | AnnSim | Sim | AnnSim | Sim | AnnSim | Sim |
| Davies–Bouldin index ( | |||||||
| 1.27 | 1.97 | 1.12 | 1.63 | ||||
| Coupling measure ( | |||||||
| 0.05 | 0.06 | 0.07 | 0.08 | 0.07 | 0.08 | 0.16 | 0.17 |
Davies–Bouldin index indicates how distant the points in a cluster are, i.e. low values suggest that drugs in a cluster are similar. The Coupling Measure indicates how similar centroids in a clustering are, i.e. low values suggest that the centroids are distant. More distant values are highlighted in bold.
Description of a cluster in the GPCR obtained using AnnSim measure
| Target | No. of interactions |
|---|---|
| Androgen receptor | 1 |
| Gamma-aminobutyric-acid receptor class | 19 |
| Heat shock protein HSP 90-alpha | 1 |
| Mineralocorticoid receptor | 1 |
| 16S rRNA | 1 |
| C-1-tetrahydrofolate synthase, cytoplasmic | 1 |
| Glucocorticoid receptor | 1 |
| Inosine-5′-monophosphate dehydrogenase 1 | 1 |
| 30S ribosomal protein S12 | 1 |
Cluster with nine elements, their targets and frequency of interactions.