| Literature DB >> 32637040 |
David N Nicholson1, Casey S Greene2.
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.Entities:
Keywords: Lterature review; Machine learning; Natural language processing; Network embeddings; Text mining; knowledge graphs
Year: 2020 PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1The metagraph (i.e., schema) of the knowledge graph used in the Rephetio project [9]. The authors of this project refer to their resource as a heterogenous network (i.e., hetnet), and this network meets our definition of a knowledge graph. This resource depicts pharmacological and biomedical information in the form of nodes and edges. The nodes (circles) represent entities and edges (lines) represent relationships that are shared between two entities. The majority of edges in this metagraph are depicted as unidirectional, but some relationships can be considered bidirectional.
A table of databases that used a form of manual curation to populate entries. Reported number of entities and relationships are relative to the time of publication.
| Database [Reference] | Short Description | Number of Entries | Entity Types | Relationship Types | Method of Population |
|---|---|---|---|---|---|
| BioGrid | A database for major model organisms. It contains genetic and proteomic information. | 572,084 | Genes, Proteins | Protein-Protein interactions | Semi-automatic methods |
| Comparative Toxicogenomics Database | A database that contains manually curated chemical-gene-disease interactions and relationships. | 2,429,689 | Chemicals (Drugs), Genes, Diseases | Drug-Genes, Drug-Disease, Disease-Gene mappings | Manual curation and Automated systems |
| Comprehensive Antibiotic Resistance Database | Manually curated database that contains information about the molecular basis of antimicrobial resistance. | 174,443 | Drugs, Genes, Variants | Drug-Gene, Drug-Variant mappings | Manual curation |
| COSMIC | A database that contains high resolution human cancer genetic information. | 35,946,704 | Genes, Variants, Tumor Types | Gene-Variant Mappings | Manual Curation |
| Entrez-Gene | NCBI’s Gene annotation database that contains information pertaining to genes, gene’s organism source, phenotypes etc. | 7,883,114 | Genes, Species and Phenotypes | Gene-Phenotypes and Genes-Species mappings | Semi-automated curation |
| OMIM | A database that contains phenotype and genotype information | 25,153 | Genes, Phenotypes | Gene-Phenotype mappings | Manual Curation |
| PharmGKB | A database that contains genetic, phenotypic, and clinical information related to pharmacogenomic studies. | 43,112 | Drugs, Genes, Phenotypes, Variants, Pathways | Gene-Phenotypes, Pathway-Drugs, Gene-Variants, Gene-Pathways | Manual Curation and Automated Methods |
| UniProt | A protein–protein interaction database that contains proteomic information. | 560,823 | Proteins, Protein sequences | Protein-Protein interactions | Manual and Automated Curation |
Fig. 2A visualization of a constituency parse tree using the following sentence: “BRCA1 is associated with breast cancer” [73]. This type of tree has the root start at the beginning of the sentence. Each word is grouped into subphrases depending on its correlating part of speech tag. For example, the word “associated” is a past participle verb (VBN) that belongs to the verb phrase (VP) subgroup.
Fig. 3A visualization of a dependency parse tree using the following sentence: “BRCA1 is associated with breast cancer” [74]. For these types of trees, the root begins with the main verb of the sentence. Each arrow represents the dependency shared between two words. For example, the dependency between BRCA1 and associated is nsubjpass, which stands for passive nominal subject. This means that “BRCA1” is the subject of the sentence and it is being referred to by the word “associated”.
Table of approaches that mainly use a form of co-occurrence.
| Study | Relationship of Interest |
|---|---|
| CoCoScore | Protein-Protein Interactions, Disease-Gene and Tissue-Gene Associations |
| Rastegar-Mojarad et al. | Drug Disease Treatments |
| CoPub Discovery | Drug, Gene and Disease interactions |
| Westergaard et al. | Protein-Protein Interactions |
| DISEASES | Disease-Gene associations |
| STRING | Protein-Protein Interactions |
| Singhal et al. | Genotype-Phenotype Relationships |
A set of publicly available datasets for supervised text mining.
| Dataset | Type of Sentences |
|---|---|
| AIMed | Protein-Protein Interactions |
| BioInfer | Protein-Protein Interactions |
| LLL | Protein-Protein Interactions |
| IEPA | Protein-Protein Interactions |
| HPRD5 | Protein-Protein Interactions |
| EU-ADR | Disease-Gene Associations |
| BeFree | Disease-Gene Associations |
| CoMAGC | Disease-Gene Associations |
| CRAFT | Disease-Gene Associations |
| Biocreative V CDR | Compound induces Disease |
| Biocreative IV ChemProt | Compound-Gene Bindings |
Fig. 4Pipeline for representing knowledge graphs in a low dimensional space. Starting with a knowledge graph, this space can be generated using one of the following options: Matrix Factorization (a), Translational Models (b) or Neural Network Models (c). The output of this pipeline is an embedding space that clusters similar node types together.
Fig. 5Overview of various biomedical applications that make use of knowledge graphs. Categories consist of: (a) Multi-Omic applications, (b) Pharmaceutical Applications and (c) Clinical Applications.