| Literature DB >> 30189889 |
Wytze J Vlietstra1, Rein Vos2,3, Anneke M Sijbers4, Erik M van Mulligen2, Jan A Kors2.
Abstract
BACKGROUND: Biomedical knowledge graphs have become important tools to computationally analyse the comprehensive body of biomedical knowledge. They represent knowledge as subject-predicate-object triples, in which the predicate indicates the relationship between subject and object. A triple can also contain provenance information, which consists of references to the sources of the triple (e.g. scientific publications or database entries). Knowledge graphs have been used to classify drug-disease pairs for drug efficacy screening, but existing computational methods have often ignored predicate and provenance information. Using this information, we aimed to develop a supervised machine learning classifier and determine the added value of predicate and provenance information for drug efficacy screening. To ensure the biological plausibility of our method we performed our research on the protein level, where drugs are represented by their drug target proteins, and diseases by their disease proteins.Entities:
Keywords: Computational pharmacology; Drug efficacy screening; Drug repurposing; Knowledge graph; Machine learning; Predicate; Provenance; Systems pharmacology
Mesh:
Year: 2018 PMID: 30189889 PMCID: PMC6127943 DOI: 10.1186/s13326-018-0189-6
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1The three included relationship scenarios. The three scenarios of relationships between drug targets and disease proteins are shown along with examples which can be found in the knowledge graph. a Drug target (DT) and disease protein (DP) are the same protein. The protein may have a relationship with itself (dotted line). b DT and DP have a direct relationship. c DT and DP have an indirect relationship through an intermediate protein (IP). Indirect relationships consist of two steps (DTIP and IPDP)
Characteristics of the two reference sets
| Characteristics | Guney reference set | EMC reference set |
|---|---|---|
| Source of drug-disease indications | MEDI-HPS + Metab2MeSH + manual curation | MEDI-HPS [ |
| Drug target sets | 238 | 314 |
| Unique drug targets | 384 | 539 |
| Source of drug targets | DrugBank | Santos et al. [ |
| Disease protein sets | 78 | 281 |
| Unique disease proteins | 2726 | 3205 |
| Minimum size of disease protein set | 20 | 1 |
| Median size of disease protein set | 52 | 5 |
| Maximum size of disease protein set | 606 | 273 |
| Source of disease proteins | OMIM + GWAS | DisGeNet, curated subset [ |
| Number of positive cases | 402 | 1250 |
| Number of negative cases | 18,162 | 86,984 |
Fig. 2Schematic overview of the feature extraction and classification process. For the sake of readability, this overview figure only shows the process for predicates. The input set contains the combinations of drug targets (DT) and disease proteins (DP) that are to be classified. Step 1: Extract paths. The paths between drug targets and disease proteins are extracted from the knowledge graph. Paths can be direct or indirect. Indirect paths have one intermediate protein (IP) and are separated in two steps: DTIP (drug target – intermediate protein) and IPDP (intermediate protein – disease protein). Step 2: Extract features. The feature set consists of all possible predicates and provenance, for each of the three scenarios (cf. Fig. 1). Based on the extracted paths for a combination, the presence or absence of each feature is set. Step 3: Classify. Based on the extracted features, the combinations are classified by a random forest classifier
Performance results for different feature sets
| Feature set | AUC Guney reference set | AUC EMC reference set |
|---|---|---|
| Overlap and co-occurrence features | 59.8% (0.9%)* | 64.9% (0.6%) |
| Overlap and predicate features | 77.6% (1.6%) | 73.1% (0.9%) |
| Overlap and provenance features | 75.1% (1.7%) | 71.3% (1.0%) |
| Overlap, predicate and provenance features (all relationships) | 78.1% (1.7%) | 74.3% (1.0%) |
| Predicate and provenance features (indirect relationships only) | 74.4% (1.9%) | 70.6% (1.0%) |
| Guney’s proximity metric | 65.6% (1.4%) | 64.6% (0.6%) |
*Values indicate mean and standard deviation of the AUCs of 100 experiments
Fig. 3The most important features for a cross-validation experiment. The top-20 most important features when trained on the complete feature set are presented. The importance measures, calculated with the standard feature importance calculation function of the random forest algorithm, have been normalized. The colours indicate whether it is a predicate, provenance, or overlap feature. While knowledge sources such as SemMedDB contain information about relationships between many types of entities, we only used the protein-protein interaction (PPI) subsets of these datasets
Classification performance stratified by the number of proteins targeted by a drug
| Number of targets per drug | Guney reference set | EMC reference set | ||
|---|---|---|---|---|
| Number of combinations | AUC | Number of combinations | AUC | |
| 1 | 133 | 71.8% (2.9%)* | 552 | 71.8% (1.4%) |
| 2 | 125 | 78.5% (2.4%) | 244 | 75.6% (1.5%) |
| > 2 | 144 | 82.4% (2.2%) | 454 | 76.6% (1.5%) |
| All | 402 | 78.1% (1.7%) | 1250 | 74.3% (1.0%) |
*Values indicate the mean and standard deviation of the AUCs for 100 experiments