| Literature DB >> 36147662 |
Alisa Pavel1,2,3, Laura A Saarimäki1,2,3, Lena Möbus1,2,3, Antonio Federico1,2,3, Angela Serra1,2,3, Dario Greco1,2,3,4.
Abstract
Big Data pervades nearly all areas of life sciences, yet the analysis of large integrated data sets remains a major challenge. Moreover, the field of life sciences is highly fragmented and, consequently, so is its data, knowledge, and standards. This, in turn, makes integrated data analysis and knowledge gathering across sub-fields a demanding task. At the same time, the integration of various research angles and data types is crucial for modelling the complexity of organisms and biological processes in a holistic manner. This is especially valid in the context of drug development and chemical safety assessment where computational methods can provide solutions for the urgent need of fast, effective, and sustainable approaches. At the same time, such computational methods require the development of methodologies suitable for an integrated and data centred Big Data view. Here we discuss Knowledge Graphs (KG) as a solution to a data centred analysis approach for drug and chemical development and safety assessment. KGs are knowledge bases, data analysis engines, and knowledge discovery systems all in one, allowing them to be used from simple data retrieval, over meta-analysis to complex predictive and knowledge discovery systems. Therefore, KGs have immense potential to advance the data centred approach, the re-usability, and informativity of data. Furthermore, they can improve the power of analysis, and the complexity of modelled processes, all while providing knowledge in a natively human understandable network data model.Entities:
Keywords: Big data; Chemical safety; Data integration; Drug design; Knowledge graph; Toxicology
Year: 2022 PMID: 36147662 PMCID: PMC9464643 DOI: 10.1016/j.csbj.2022.08.061
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Examples of existing relevant data sources for drug design and safety assessment with possible insights these data can provide. How these data can be linked to other entity nodes is displayed in Fig. 1.
| Related Node Type | Data Type | Data Source | Possible Insights |
|---|---|---|---|
| COMPOUND | Structure | PubChem | Structural/ descriptive information of compounds |
| Effects | SIDER | Clinical/ Toxicity/ observable effect of compounds | |
| MOA | GEO | MOA of compounds | |
| GENE (Gene Product) | Function | Ensembl | Gene/ Protein Family/ Function Groups |
| Interaction | HIPPIE | Protein Interaction | |
| Regulation | TRRUST | Gene Regulation | |
| PHENOTYPE | Clinical | NCBI | Phenotype relationships, comorbidities, descriptions |
| Molecular | GEO | MOA of Phenotypes | |
| ASSOCIATIONS | Function & Effect | GO | (Functional) Groups |
| CELL LINE/ TISSUE/ ORGAN | (Molecular) Characteristics | Human Protein Atlas | MOA of biological systems under different conditions |
Fig. 1Possible high-level schema of a life science KG focused on chemical safety assessment and drug development, outlining different data types & links from a compound centred perspective. Covering data describing its MOA, (observable) effect, structure and compound specific meta-data. Examples of data sources that can provide these links are listed in Table 1.
Fig. 2Diverse data sources can be integrated into a unified data model, such as a KG. Through data integration, hidden links from the individual data sources can be made visible. In addition, the KG can be used to generate/ infer new knowledge (links) based on existing data.
Fig. 3Example of graph exploration with respective Cypher (Neo4j query language) commands. The figure shows an example of a subset of a KG, each with 3 nodes and 2 edges. The grey lines are links that could be inferred from the existing data, via exploration of the dashed lines. A) Gene (products) possibly belonging to a specific pathway are inferred, through one step neighbours of known gene (products) belonging to this specific pathway. B) A possible drug to treat a certain phenotype via the knowledge of a gene (product) causing this phenotype as well as a drug - gene (product) relationship is sought. Below the figure, examples of cypher (Neo4j query language) queries are shown, which show how the graph can be explored and missing links can be inferred, in a very simplistic manner. If the graph would contain multiple genes or drugs that would fit the criteria outlined in the queriers, multiple results would be returned.
Fig. 4Schematic representation for compound development and risk assessment. In a data-driven pipeline only compounds that pass the knowledge-based risk assessment, for example via a KG, are allowed to continue into experimental based evaluations. This reduces development costs, increases safety and improves development speed since only compounds with a high probability of success are allowed to continue. New data generated can constantly be re-fed into the KG and used to re-evaluate the compounds for the next step in the pipeline. All information gained during the process is added to the KG and can be used for other compounds in the future.
Examples of KGs, their size and integrated data layers.
| Publication | Problem | Number of Data Layers | Data Layers | KG Size |
|---|---|---|---|---|
| Zhang et al. | Prediction of Adverse Drug Reactions | 3 | Drug - Side Effect | 12,473 nodes |
| Al-Saleem et al. | Drug Repositioning for Covid-19 | 11 | Gene - Gene | > 6 M nodes |
| Pavel & del Giudice et al. | Identification of Genes Associated with Covid-19 | 2 | Gene - Gene | 27,892 nodes |
| Wang et al. | Prediction of Drug - Drug Interactions | 5 | Drug - Gene (3 relationship types) | NA |
| Thafar et al. | Prediction of Drug - Target Interactions | 1 | Usage of multiple benchmarking data sets | |
| Mohamed et al. | Prediction of Drug - Target Interactions | 1 | Usage of multiple benchmarking data sets | |
| Abdelaziz et al. | Prediction of Drug - Drug Interactions | At least 6 | Drug - Gene | NA |
| Zhang et al. | Drug repositioning for Covid-19 | At least 15 | Based on subset of SemMedDB | 331,427 nodes |
| Chen et al. | Collection of Clinical Trial data | 21 | Meta data & results of the clinical trials but not linked to additional information outside of this data |
Data integration related challenges for different data types possibly needed in a drug and chemical centred KG.
| Data Type | Common Identifiers & Ontologies | Associated Challenges for the Data Integration Task |
|---|---|---|
| Chemicals/Drugs/Compounds | SMILE | While canonical SMILES are defined, they are not always used in reporting but instead their parent identifiers of simple SMILES are used, which change based on where in the compound structure they are started. Therefore multiple SMILES for the same compound can be created.Depending on the features used to compute chemical fingerprints or molecular descriptors, the same fingerprint/ descriptor can be computed for compounds varying in their 3D structure (e.g. through bond rotation) |
| Genes/ Gene products | Entrez | Between different identification systems there is not always a 1-1 mapping available. |
| Gene Sets | Pathways | Even though for example pathways are defined on a conceptual level, pathways are not 1–1 mappable between platforms. |
| Clinical Data/Phenotypes | Name | Medical terms are often language dependent, making an international mapping challenging. In addition many different “unified” standards have been proposed, which use different terms and classes, indicating that a 1-1 mapping does not always exist, in addition to the challenge that every user will have their own preferences to which naming system to use. |
| Celllines / Tissues | Name | There is no agreed standard on how to report cell-line or tissue names and especially for commercial cell-lines the names may be producer dependent. |