| Literature DB >> 36175836 |
E C Wood1, Amy K Glen2, Lindsey G Kvarfordt1, Finn Womack3, Liliana Acevedo1, Timothy S Yoon1, Chunyu Ma4, Veronica Flores1, Meghamala Sinha1, Yodsawalai Chodpathumwan5, Arash Termehchy1, Jared C Roach6, Luis Mendoza6, Andrew S Hoffman7, Eric W Deutsch6, David Koslicki3,4,8, Stephen A Ramsey1,9.
Abstract
BACKGROUND: Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API).Entities:
Keywords: Biomedical knowledge integration; Knowledge graph; Semantic normalization
Mesh:
Year: 2022 PMID: 36175836 PMCID: PMC9520835 DOI: 10.1186/s12859-022-04932-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Overall Workflow of RTX-KG2. Blue triangle: individual external source; light blue cloud: external API endpoint; yellow parallelogram: tab-separated value (TSV) file-set; green hexagon: JavaScript Object Notation (JSON) File; orange cloud: API endpoint output; grey rectangle: SQLite [66] database; brown circle: abstract object-model representation of KG2c; turquoise computer: user/client computer; orange server: Translator knowledge graph exchange (KGE) server
RTX-KG2 integrates 70 knowledge sources into a single graph. Each row represents a server site from which sources were downloaded.
| Name | # | Description | Format | Method |
|---|---|---|---|---|
| Biolink [ | 1 | Biolink model (semantic layer) | TTL | RBM |
| ChEMBL [ | 1 | EMBL chemogenomic database | SQL | D2J |
| DGIdb [ | 1 | Drug gene interaction database | TSV | D2J |
| DisGeNET [ | 1 | Disease-gene associations | TSV | D2J |
| DrugBank [ | 1 | Pharmaceutical knowledge base | XML | D2J |
| DrugCentral [ | 1 | Online drug compendium | SQL | D2J |
| Ensembl Gene [ | 1 | Ensembl human gene annotations | JSON | D2J |
| EFO [ | 1 | Experimental Factor ontology | OWL | RBM |
| GO [ | 1 | Gene ontology annotations | TSV | D2J |
| HMDB [ | 1 | Human metabolite database | XML | D2J |
| IntAct [ | 1 | IntAct molecular interaction database | TSV | D2J |
| Jensen Lab Diseases [ | 1 | Gene to diseases dataset | TSV | D2J |
| KEGG [ | 1 | Kyoto encyclopedia of genes and genomes | API | D2J |
| miRBase [ | 1 | MicroRNAs dataset | DAT | D2J |
| NCBI Gene [ | 1 | NCBI human gene annotations | TSV | D2J |
| OBO Foundry | 21 | OBO foundry ontologies (Additional file 1: Table S1) | OWL | RBM |
| Orphanet [ | 1 | Orphanet rare disease ontology | OWL | RBM |
| PathBank [ | 1 | Wishart lab pathway databases | XML | D2J |
| Reactome [ | 1 | Pathway database | SQL | D2J |
| SemMedDB [ | 1 | Semantic MEDLINE database | SQL | D2J |
| SMPDB [ | 1 | Small molecule pathway database | CSV | D2J |
| UMLS [ | 26 | Unified medical language system (Table | TTL | RBM |
| UniChem [ | 1 | EBI small molecule cross-refs | TSV | D2J |
| UniProtKB [ | 1 | UniProt knowledge base | DAT | D2J |
| Total | 70 |
Columns as follows: Name, the short name(s) of the knowledge sources obtained or the distribution name in the cases of UMLS and OBO Foundry; #, the number of individual sources or ontologies obtained from that server; Format, the file format used for ingestion (see below); Method, the ingestion method used for the source, either D2J for direct-to-JSON or RBM for the RDF-based method. File format codes: CSV, comma-separated value; DAT, SWISS-PROT-like DAT format; JSON, JavaScript object notation; OWL, OWL in RDF/XML [67] syntax; RRF, UMLS Rich Release Format [68]; SQL, structured query language (SQL) dump; TSV, tab-separated value; XML, extensible markup language. Other abbreviations: NCBI, National Center for Biotechnology Information; EMBL, European Molecular Biology Laboratory
Fig. 2Node concept types in RTX-KG2.7.3 are based on the Biolink model version 2.1.0 [49, 50]
Fig. 3Edge predicate types in RTX-KG2.7.3 are based on the Biolink model version 2.1.0
Fig. 4Number of nodes in RTX-KG2.7.3pre, by category
Fig. 5Number of edges in RTX-KG2.7.3pre, by predicate
Fig. 6Node degree (inout) distribution of RTX-KG2.7.3c
Fig. 7Node neighbor counts by category for the top 20 most common categories in RTX-KG2.7.3c. Each cell captures the number of distinct pairs of neighbors with the specified subject and object categories
Node and edge counts for various knowledge graphs
| Nodes | Edges | |
|---|---|---|
| HETIONET, v1 [ | 47,031 | 2.3 million |
| SPOKE ver. 20190707 [ | 2.15 million | 6.16 million |
| SRI Reference KG, ver. 2.0 | 20.2 million | 41.6 million |
| ROBOKOP [ | 6 million | 140 million |
| RTX-KG2.7.3pre | 10.2 million | 54.0 million |
| RTX-KG2.7.3c | 6.4 million | 39.3 million |
Numbers of unique node categories, edge predicates, and meta-triples for various knowledge graphs
| Categories | Predicates | Meta-triples | |
|---|---|---|---|
| SPOKE, TRAPI v1.2.0 API | 14 | 24 | 44 |
| SRI Reference KG, ver. 2.0 | 62 | 59 | 2047 |
| ROBOKOP, TRAPI v1.2.0 API | 20 | 185 | 2234 |
| RTX-KG2.7.3pre | 56 | 77 | 10,269 |
| RTX-KG2.7.3c | 56 | 77 | 41,225 |
Fig. 8The proportion of results ARAX obtains for various one-hop queries when it is not allowed to use RTX-KG2 as one of its knowledge providers vs. when it is allowed to use RTX-KG2. A result of 100% means that RTX-KG2 provided no additional answers over ARAX’s other 12 Translator knowledge providers for that query; 0% means that all of ARAX’s results for that query came from RTX-KG2
Fig. 9The BioThings Explorer query graph builder, which can be used to query RTX-KG2 among other Translator reasoning agents and knowledge providers
Fig. 10Flowchart of tasks for building RTX-KG2pre (the precursor stage of RTX-KG2) from 21 upstream knowledge-base distributions
Node properties in RTX-KG2pre and RTX-KG2c
| KG2pre | KG2c | |
|---|---|---|
| all_categories | ||
| all_names | ||
| category | ||
| category_label | ||
| creation_date | ||
| deprecated | ||
| description | ||
| equivalent_curies | ||
| full_name | ||
| has_biological_sequence | ||
| id | ||
| iri | ||
| knowledge_source | ||
| name | ||
| provided_by | ||
| publications | ||
| replaced_by | ||
| synonym | ||
| update_date |
Edge properties in RTX-KG2pre and RTX-KG2c
| KG2pre | KG2c | |
|---|---|---|
| id | ||
| kg2_ids | ||
| knowledge_source | ||
| negated | ||
| object | ||
| predicate | ||
| predicate_label | ||
| provided_by | ||
| publications | ||
| publications_info | ||
| relation | ||
| relation_label | ||
| subject | ||
| update_date |
Upstream source files that must be staged in S3 in order to build RTX-KG2
| DrugBank | XML download | Requires browser to download |
| RepoDB | TSV download | Requires browser to download |
| SemMedDB | MySQL download | Requires browser to download |
| SMPDB PubMed IDs | CSV download | Obtained courtesy of Wishart Lab |
| UMLS metathesaurus | ZIP download | Requires browser to download |
UMLS sources that are integrated into RTX-KG2.
| UMLS semantic network | |
| Anatomical therapeutic chemical classification system | ATC |
| DrugBank database | DRUGBANK |
| Foundational model of anatomy | FMA |
| Gene ontology | GO |
| Healthcare common procedure coding system | HCPCS |
| Human gene nomenclature committee | HGNC |
| Health level seven version 3.0 | HL7V3.0 |
| Human phenotype ontology | HPO |
| ICD-10 procedure coding system | ICD10PCS |
| ICD-9, clinical modification | ICD9CM |
| Logical observation identif. Names & Codes | LNC |
| Medication reference terminology | MED-RT |
| MEDLINE plus | MEDLINEPLUS |
| Medical subject headings (MeSH) [ | MSH |
| Metathesaurus | MTH |
| NCBI taxon | NCBI |
| National cancer institute thesaurus | NCI |
| National drug data file | NDDF |
| National drug data file-reference terminology | NDFRT |
| Online Mendelian inheritance in man [ | OMIM |
| Physician data query | PDQ |
| Psychological index terms | PSY |
| RxNorm (normalized drug names) | RXNORM |
| National drug file | VANDF |
See the "Discussion" section regarding UMLS sources that could not be included due to licensing