Literature DB >> 36147662

The potential of a data centred approach & knowledge graph data representation in chemical safety and drug design.

Alisa Pavel^1,2,3, Laura A Saarimäki^1,2,3, Lena Möbus^1,2,3, Antonio Federico^1,2,3, Angela Serra^1,2,3, Dario Greco^1,2,3,4.

Abstract

Big Data pervades nearly all areas of life sciences, yet the analysis of large integrated data sets remains a major challenge. Moreover, the field of life sciences is highly fragmented and, consequently, so is its data, knowledge, and standards. This, in turn, makes integrated data analysis and knowledge gathering across sub-fields a demanding task. At the same time, the integration of various research angles and data types is crucial for modelling the complexity of organisms and biological processes in a holistic manner. This is especially valid in the context of drug development and chemical safety assessment where computational methods can provide solutions for the urgent need of fast, effective, and sustainable approaches. At the same time, such computational methods require the development of methodologies suitable for an integrated and data centred Big Data view. Here we discuss Knowledge Graphs (KG) as a solution to a data centred analysis approach for drug and chemical development and safety assessment. KGs are knowledge bases, data analysis engines, and knowledge discovery systems all in one, allowing them to be used from simple data retrieval, over meta-analysis to complex predictive and knowledge discovery systems. Therefore, KGs have immense potential to advance the data centred approach, the re-usability, and informativity of data. Furthermore, they can improve the power of analysis, and the complexity of modelled processes, all while providing knowledge in a natively human understandable network data model.

Entities: Chemical

Keywords: Big data; Chemical safety; Data integration; Drug design; Knowledge graph; Toxicology

Year: 2022 PMID： 36147662 PMCID： PMC9464643 DOI： 10.1016/j.csbj.2022.08.061

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

The development of new drugs and chemicals is a long and expensive endeavour [1], [2], [3]. An integral part of the safety assessment process is the evaluation of the safety and efficacy of new compounds, which relies on tests that are time consuming, costly, and ethically challenging [3], [4]. Therefore, a shift towards alternative methods for the traditional assessment of apical endpoints is taking place. Such efforts promote the reduction of animal experimentation and the use of integrated approaches where multiple testing strategies and research angles are combined [5]. At the same time, the efforts to reduce and replace experimental animals poses novel challenges for the evaluation of systemic effects and long-term outcomes of chemical exposures. Regardless of the test system, a comprehensive understanding, modelling, and prediction of organism-level responses requires the integration and analysis of multiple data layers. In this sense, data becomes even more central and valuable, and the computational strategies applied, grow increasingly important as large sets of data need to be analysed and integrated. Constantly new independent data sets, relevant for chemical design and safety assessment, are generated. However, these data sets are often highly scattered, not comparable, and of varying quality. Significant efforts have been made to establish standards for data sharing and management through the establishment of the FAIR (Findable, Accessible, Interoperable and Reproducible) principles [6], however they often fall short when large amounts of data need to be integrated [7]. In addition to defining data accessibility standards, metadata reporting standards, robust data integration methodologies need to be further developed. These aspects are fundamental for combined data analysis and modelling of complex processes while also improving the optimal use of all available data. Nonetheless, the data may still not reach its full potential unless it is stored in a structure that enables straight-forward integration of various data types and layers. To this end, Knowledge Graphs (KG) present a suitable framework. A KG is a data structure that contains and conveys knowledge about the “real world under investigation”, in which the data is stored in a graph based format [8], [9], [10]. This opens unprecedented possibilities also for drug and chemical design and safety assessment. KGs are an extension of a knowledge base, to which a reasoning engine is applied to generate and infer new facts about the world [8]. KGs are data collections that model structured knowledge in a graph based format and can be used 1) as a knowledge base or database (e.g. the Google search engine), 2) to analyse the data by making use of graph based metrics and methods (e.g. traffic routing systems) and 3) to infer new facts about the world (e.g. Amazon recommendation engine). The latter application is what distinguishes KGs from classical knowledge bases [8]. The underlying graph model can be directed, undirected, heterogeneous or property graph structures, that can contain edge and node labels as well as attributes [9], [11]. How such a graph is stored on disk, can change between database management systems. An example of a more specific graph database model are, triple stores, which store everything as an edge, including properties, while other graph database engines store data in different manners, for example as a multigraph or adjacency list [12], [13] However this review will not cover this topic in more detail, and while there are performance difference to be observed between different data management and storage options for specific use cases, this topic is of lower relevance for users wishing to use a ready-made solution, i.e. a database management system. More information on this topic can be found in these reviews [12], [13]. The application of KGs and its benefits in life sciences have been extensively described [14], [15], [16], [17], [18], [19], [20], [21], [22]. However, while the avenues to explore by using KGs are vast and exciting, the limitations and roadblocks need to be addressed as well. This review describes available data sources that can support the drug design and chemical safety assessment process. It reviews currently available as well as possible KG applications for drug design and chemical assessment. Lastly it discusses how the use of KGs can advance the integration and analysis of the different data layers in the context of chemical and drug development, and address the challenges standing in the way of the full exploitation of these data structures from a data centric view.

Available data sources for chemical safety and drug design

The in silico drug design and chemical safety assessment process relies on knowledge from different areas of the life sciences [5], [16], [23], [24]. For example, the structural information of compounds can be complemented with toxicogenomics data to study their mechanisms of action (MOA), defined as the underlying molecular processes happening in the biological systems under specific conditions [25], [26], [27]. Moreover, this knowledge can be integrated with clinical (trial) data to further explore the compound effects on a large population. This knowledge can be supported by organism specific information, data from systems biology and lab based experimental data. An overview of possible data sources that can be used to address different aspects of the chemical assessment and drug design processes are listed in Table 1. KGs can easily support the integration of such heterogeneous data. In Fig. 1, we present a possible high level schema for a KG, that in its whole or on a sub-graph level can be used for chemical safety assessment and drug design.

Table 1

Related Node Type	Data Type	Data Source	Possible Insights
COMPOUND	Structure	PubChem [28], STITCH [29], ZINC20 [30], QSAR-DB [31]	Structural/ descriptive information of compounds
	Effects	SIDER [32], Pharos [33], DrugCombDB [34], CTD [35], OpenTargets [36], DrugBank [37], Tox21 [38], [39], ECOTOX (cfpub.epa.gov/ecotox/), ToxCast (epa.gov/chemical-research/toxicity-forecasting)	Clinical/ Toxicity/ observable effect of compounds
	MOA	GEO [40], LINCS L1000 [41], CTD [35], TG-Gates [42]	MOA of compounds
GENE (Gene Product)	Function	Ensembl [43], Panther [44], [45]	Gene/ Protein Family/ Function GroupsOrganism matching
	Interaction	HIPPIE [46], HitPredict [47], [48], HuRI [49], MINT [50], IntAct [51], String [52]	Protein Interaction
	Regulation	TRRUST [53], [54], TargetScan [55], [56], miRTarBase [57], InnateDB [58]	Gene Regulation
PHENOTYPE	Clinical	NCBI [59] MedGen, NCBI ClinVar [60], DisGeNet [61], Human Phenotype Ontology [62], Orphanet (orpha.net), OMIM (omim.org)	Phenotype relationships, comorbidities, descriptions
PHENOTYPE	Molecular	GEO [40], GWASCatalog [63], ArrayExpress [64], CTD [35]	MOA of Phenotypes
ASSOCIATIONS	Function & Effect	GO [65], [66], MSigDB [67], [68], Reactome [69], Wikipathways [70], KEGG [71], [72], EnrichR [73], AOP-Wiki (aopwiki.org)	(Functional) Groups
CELL LINE/ TISSUE/ ORGAN	(Molecular) Characteristics	Human Protein Atlas [74], GTex (gtexportal.org), GEO [40], ENCODE [75], CellMiner [76], [77]	MOA of biological systems under different conditions

Fig. 1

Possible high-level schema of a life science KG focused on chemical safety assessment and drug development, outlining different data types & links from a compound centred perspective. Covering data describing its MOA, (observable) effect, structure and compound specific meta-data. Examples of data sources that can provide these links are listed in Table 1.

Examples of existing relevant data sources for drug design and safety assessment with possible insights these data can provide. How these data can be linked to other entity nodes is displayed in Fig. 1. Possible high-level schema of a life science KG focused on chemical safety assessment and drug development, outlining different data types & links from a compound centred perspective. Covering data describing its MOA, (observable) effect, structure and compound specific meta-data. Examples of data sources that can provide these links are listed in Table 1. Many of the data types listed in Table 1 are by nature link-orientated (e.g. protein–protein interactions [46], [52]), drug target information [[37], [78], [79]) or directly produced or represented in graph structures (e.g. co-expression networks [80], [81], regulation networks [54]) [82]. This natural network representation and link orientation of the data is one of the main advantages to model these data in an integrated network fashion. Analysing and modelling biological knowledge in a network structure, is a common methodology in systems biology [83], [84], since it allows to investigate an entity with respect to all other entities in the network and to model the information flow through a network (e.g. in a protein–protein interaction network, in a regulation network or in a gene co-expression network) [20], [23], [80], [85].

Advantages of data modelling with graph databases & exploration through KGs

The dual nature of KGs as knowledge base and inference engine in combination with allowing the data to be analysed form a network perspective in addition to traditional methods, can bring many advantages to studies based on drug development and compound safety assessment, which are outlined below.

KGs as a data modelling system to improve flexibility, re-usability and expandability of data rich studies

Meta-analysis is a common tool to investigate a set of studies in order to gain statistical insight into common research questions and their findings. Here we use meta-analysis as an example to showcase how large scale data management and data integration, as defined in a KG, can improve quality, re-usability and minimise the cost of such studies. Meta-analysis based studies can be cost-intensive due to the amount of manual work needed to collect and annotate research studies, perform statistical analysis, and interpret the results. With the growing volume of research studies, this problem becomes more challenging [86]. Therefore it has been shown that optimised data representation, that allows data scaling and reuse can reduce the technical issues associated with meta-analysis based studies [87]. KGs are data structures excelling with data of varying quality, type and gaps. The time spent on the search, extraction and comparison of studies can be substantially reduced through semantic annotation. Furthermore, KGs allow flexible data representation and are easily scalable with new data and scopes. Statistical methods and analysis can directly be applied onto the KG [88], allowing not only the usage of the data stored in the KG but additional layers of information contained in the graph topology. For example, Yang et al. [89] conducted a meta-analysis on ecological hazard data to investigate nanoplastic ecotoxicity. In their study they made use of data containing information of particle size on specific observable endpoints, such as population growth, mortality and reproduction. This data is by default link-orientated, making it easy to integrate in a KG model. On expanding the study to include more data sets, additional information or more observable endpoints can easily be retrieved from the KG in an unified format, eliminating the expensive data pre-processing step needed in most meta-analyses [87]. Wang et al. [90] combined multiple gene expression data sets covering the response to a pulmonary tuberculosis infection in order to identify possible therapeutic targets. Gene expression information in response to certain conditions can again easily be integrated into a link oriented data model, while the experimental entity can be enriched with the necessary metadata of the exposure, which can be linked to similar experiments. In addition to extracting the direct information, this also allows to identify similar studies that could be included in further studies or meta-analyses. In the pharmaceutical industry, model based meta-analyses can be used during the drug development process, which helps to leverage (prior) knowledge in order to make informed decisions about the potential of a compound [91], [92]. Such data could for example contain information from previous clinical studies or information about possible competing products, which can help for example to make informed decisions about optimal dosing or to perform a risk assessment of the compound's profitability [92]. Another example how KGs can aid data centred studies by providing a unified data schema which integrates multiple layers of diverse data is showcased in Federico et al. [23]. In their study, the authors exploited multiple data types that were unified in their custom KG [20] for a drug repositioning study focused on the prioritisation of drug combinations for the treatment of human complex diseases. Since molecular targets of drugs are both soluble proteins and/or receptors, a co-expression network of the disease has been filtered by using multiple data sets (protein–protein interactions, functional relationships in biological pathways and regulatory interactions) integrated in their KG, retaining only edges (and nodes) of the co-expression network supported by these data. In this way, they were able to leverage the biological significance of the disease co-expression network, which refined the predicted drug combinations by exploiting existing molecular knowledge.

Flexibility of a graph based data model & the leveraging of hidden links

While many relevant data sources (Table 1) do already come in a network based or link oriented data format, they are still individual and independent data sources that need to be integrated into a combined data model. KGs (Graph data models) have shown the potential to be a successful framework for the integration of diverse data sets [93]. Effective integration and analysis of the comprehensive data sources could significantly increase the success rate in drug design and chemical safety assessment [92], [94], [95], [96]. Zhang et al. [14] made use of the integration of drug - side effect, drug - indication and drug - target information to predict drug - adverse outcome relationships in a KG framework. Al-Saleem et al. [16] created the CAS Biomedical Knowledge Graph, which integrates multiple data sources across 11 different data layers focusing on COVID-19 relevant information in order to use the KG framework for drug repositioning studies for COVID-19. However, while there is a general consensus that more and diverse data can provide a more complete view [97], [98] on complex biological processes [99], many of the individual data sources follow different standards, were produced for different problems and therefore do not always contain the same data points or are non-complete. Graph databases and KG technology offer a good solution to this challenging integration task [93]. They are by nature schema-free and allow the integration of different types of data with different levels of quality and completeness [9] as well as allow the integration of hierarchical dependencies between data points, making the data model easily expandable and adjustable to changes over time. In comparison to relational databases, the schema-free nature implies that the database schema does not need to be defined in advance and therefore can evolve over time with the data. However this also implies that the user is responsible to keep the data “clean”, i.e. to assign the same node/ edge types to data points of the same class, use the same property types for the same data properties as well as understand that while the data model allows gaps in the data, these gaps will still affect any downstream models applied. Additionally, graph structures are suitable models for biological systems, making it intuitive to understand their complex organisation. Network structures, especially their connections can be easily visualised and explored [20], [80], [100], [101], [102], [103], [104]. Moreover, the extraction and analysis of sub-graphs can provide a more informative view of the process under study. Pavel & del Giudice et al. [20] used a sub-area of the gene (product) interaction data layer in their KG infrastructure to analyse possible molecular processes associated with COVID-19. Serra et al. [105] used a network based approach to perform engineered nanomaterial (ENM) contextualisation. In their constructed network, the nodes represent four types of entities (ENMs, chemical exposures, drug treatments and human diseases), while the edges represent the similarity between entities based on their induced transcriptional alteration. The network was scanned in search of heterogeneous cliques of four nodes (one ENM, one chemical, one drug and one human disease) in order to contextualise the effect of ENM exposure with respect to the other entities. This analysis highlighted strong connections between metal oxide nanoparticles and neurodegenerative disorders. Ratajczak et al. [106] showcased that filtering KGs to only contain task relevant information can lead to significant prediction performance improvements. The authors were able to reach an improvement of up to 40 % when predicting possible drug targets via graph embedding. The use of property graphs allows not only to add edge centred data to the KG, but to enrich nodes and edges with properties, which can be unique to a specific data point. This allows the easy integration of for example quantitative data, such as age or gene expression counts. In order to make this data comparable in a graph model, it can be assigned to classes, such as child, adolescent, adult or low, middle or high gene expression, which can be added to the graph model as their own nodes. For an example on how to classify gene expression counts, see the “Discriminant Fuzzy Pattern to Filter Differentially Expressed Genes” method [107]. By adding higher level classes of these terms as nodes, the graph topology can be used to gather further insights, while when the accurate, individual terms/ values are needed, the properties of the datapoint can be retrieved, making this data model highly versatile in application. In addition, KGs can efficiently be queried by specialised graph query languages, which are pattern orientated, allowing a detailed exploration of linked data (Fig. 2) and graph topologies, the latter which is information not available in other data representation formats.

Fig. 2

Diverse data sources can be integrated into a unified data model, such as a KG. Through data integration, hidden links from the individual data sources can be made visible. In addition, the KG can be used to generate/ infer new knowledge (links) based on existing data.

Leveraging the topology of a KG

One main advantage of the integration of many data layers in a single KG is the possibility to retrieve the so-called “hidden links”, which are relationships, associations and correlations that are contained indirectly in the data but not visible in the raw data without the additional topological information (Fig. 2). These hidden links can easily be spotted in a graph based format, but are difficult to investigate in a relational data format. For example, by integrating knowledge about gene product interactions (e.g. protein–protein interaction networks), with drug - target information as well as gene product - phenotype information drug - phenotype links can directly be retrieved from the network even though this information is not directly contained in any of the integrated datasets. Fig. 3 shows how such links can be explored. In addition, classical topological network metrics can be used to evaluate entities. Such metrics are for example degree centrality, closeness centrality or edge betweenness centrality [88], [108].

Fig. 3

Example of graph exploration with respective Cypher (Neo4j query language) commands. The figure shows an example of a subset of a KG, each with 3 nodes and 2 edges. The grey lines are links that could be inferred from the existing data, via exploration of the dashed lines. A) Gene (products) possibly belonging to a specific pathway are inferred, through one step neighbours of known gene (products) belonging to this specific pathway. B) A possible drug to treat a certain phenotype via the knowledge of a gene (product) causing this phenotype as well as a drug - gene (product) relationship is sought. Below the figure, examples of cypher (Neo4j query language) queries are shown, which show how the graph can be explored and missing links can be inferred, in a very simplistic manner. If the graph would contain multiple genes or drugs that would fit the criteria outlined in the queriers, multiple results would be returned. Pavel, del Giudice et al. [20] leveraged shortest paths to identify genes that link known gene sets associated with COVID-19, in order to identify possible genes associated with the disease but are neither direct interactors of the virus or measurable in differential expression analysis. With the help of the applied topological exploration, the authors were able to identify a set of intermediate genes and link them to relevant biological processes, such as vascular processes. Through the additional integration of drug - gene target information in their KG model, the authors were able to suggest possible drug repositioning candidates based on the identified gene sets. Zhu et al. [109] constructed a drug KG, which they used to explore possible drug repositioning candidates. Next to an embedding based approach they also explored paths that connected diseases with drugs in order to extract the connectivity information between a drug - disease pair. In their study, investigating the mechanism of action of engineered nanomaterials in in vivo and in vitro, Kinaret et al. [110] showed that by exploring the expression profiles via gene co-expression networks [85] and functional groups contained in them (communities) [108] the in vivo & in vitro functional responses converged, which was not observable when comparing the differential expressed genes directly. Madi et al. [111] built an antigen-antigen correlation network from antigen microarray data and by extracting their minimum spanning tree they were able to create immune trees in order to compare these between mothers and their newborns. In Pavel et al. [85], they compared the mechanism of action of dasatinib and mitoxantrone via topological properties of gene co-expression networks. In Federico et al. [23], the authors prioritised potentially relevant drugs by considering the MOA of drugs, their structure and topological properties of the disease network. Drug combinations are prioritised based on having “long” shortest paths between their targets on the created cancer co-expression network, so as to target non-overlapping areas. In addition, drugs that target central genes in terms of degree centrality in the cancer co-expression network are prioritised. The criterion behind this assumption is, that by targeting genes that are central in the network it is possible to indirectly expand the effect of the drug to the widest area of the network. This means that the selected drug combinations target genes that show high connectivity in the cancer network, covering, in this way, the widest area of the network, so as to maximise the therapeutic effect of the combination, and minimising the functional overlap of the drugs. Topological information of the graph can be used on a local level to assess the quality of knowledge of individual entities or whole subgraphs. For example, similar entities can be compared based on their connectivity profile to evaluate the quality of individual relationships [112] or individual relationships can be scored based on their likelihood to be true based on topologically close entities as well as the connection to similar node entities in the graph. The same principle could be applied to assess the correctness of node or edge labels or to add possible correct labels [113]. The underlying assumption is that similar entities should be connected to similar other entities. This idea is explored in for example network matching algorithms [114] as well as in node embedding algorithms, such as node2vec [115], which leverages random walks to translate the graph space into a vector space, where close/ similar connected nodes are translated to be near in space.

Making the graph space interpretable by classical machine learning algorithms through node embedding

A lot of current approaches that use KGs to gain new insight into biological processes are based on node embedding methodologies [14], [17], [21], [22], [116], such as node2vec [115], in combination with a classification algorithm, such as logistic regression-based classifiers, to solve the link prediction problem present in a KG (a new fact about the world under investigation translates into a new edge in the KG, which reduces most prediction problems on a KG to a link prediction problem). Embedding based methods have the advantage that they translate the graph into a vector space, making it suitable for the application of existing prediction/ classification models. Zhang et al. [14] made use of a KG and its custom node embedding, based on the word2vec algorithm [117], to link drugs with their potential adverse drug reactions, based on a logistic regression classifier applied to the vectorized node embeddings. Karim et al. [22] propose a framework leveraging KG embedding methodologies to predict possible interactions between drugs, Myklebust et al. [118] assessed the ecotoxicological effect of chemicals via KG embedding and Mohamed et al. [119] predicted possible drug targets via KG embeddings.

Example applications of KGs in drug development and safety assessment

While KGs represent a valuable instrument that facilitate the integration of multi-source and heterogeneous data, they provide an unprecedented opportunity to gain new knowledge to guide de novo drug design. However, disentangling and understanding data of such high complexity and diversity is perhaps the biggest challenge of big data exploitation. A schematic representation of how KGs can be applied to aid clinical trials during compound development and risk assessment is displayed in Fig. 4 while multiple examples of KGs and the knowledge gained from them in different areas of toxicology and chemical/ drug development are outlined in this section.

Fig. 4

Schematic representation for compound development and risk assessment. In a data-driven pipeline only compounds that pass the knowledge-based risk assessment, for example via a KG, are allowed to continue into experimental based evaluations. This reduces development costs, increases safety and improves development speed since only compounds with a high probability of success are allowed to continue. New data generated can constantly be re-fed into the KG and used to re-evaluate the compounds for the next step in the pipeline. All information gained during the process is added to the KG and can be used for other compounds in the future.

Drug adverse outcome & drug target predictions

Prediction models to link chemicals or drugs with their possible phenotypic outcomes, such as possible side effects/ adverse reactions, have been developed in a KG framework [14], [15]. Often the prediction of adverse drug reactions is carried out by considering one data layer at a time, such as the chemical structure, ADME (absorption, distribution, metabolism, excretion), or its molecular targets. KGs give the opportunity to investigate a drug or a set of drugs over multiple data layers at the same time, in a combined data model and analysis framework. These approaches aid the drawing of connections among drugs, relying on more robust predictions that are based on a bigger number of characteristics with respect to the past. Zhang et al. [14] constructed a KG comprising drug, indication, target and side effect (adverse outcome) nodes, and three relationship types between them (has side effect, has target and has indication) [14]. Through node embedding and the application of a classifier they tried to link drugs with possible adverse outcomes. They tested their model on a dataset of 862 FDA approved drugs, containing information if the drug has a risk or no risk of inducing liver injury. While they were not able to infer the severity of the risk of liver injury, their model was able to discriminate between those not inducing liver injury and those that can induce liver injury. One of the crucial steps in drug development is the identification of drug targets. Predicting the ability of a certain compound to interact with a molecular target and the effect that the compound has on it is a challenging task, which can be simplified and sped up through the application of KGs and thorough integration of large scale and diverse data layers. For example, Thafar et al. [120], developed a computational method, called DTiGEMS+, that can predict drug-target interactions by combining graph embedding, by means of the node2vec algorithm [115], and machine learning classifiers such as artificial multilayer perceptron [121], random forests [122] and adaptive boosting [123].

Predicting drug-drug interactions

Predicting chemical-chemical or drug-drug interactions in silico can reduce development costs significantly as well as improve their safety [22]. Different KG based frameworks to predict drug-drug interactions have been proposed recently [21], [22], [124]. Both Wang et al. [21] and Karim et al. [22] propose a framework leveraging KG embedding methodologies to predict possible interactions between drugs, while Abdelaziz et al. [124] make use of similarity metrics computed on drug information and KG structure applied to a logistic regression model to identify potential drug-drug interactions.

Drug repositioning

One of the fastest and cost-efficient methods to treat existing or new diseases is through drug repositioning, where already approved/ existing drugs are applied to other conditions and therefore only require a fraction of the assessment and approval step than novel compounds would. However, testing multiple compounds for a set of conditions in an in vivo/ clinical setting is not feasible. Therefore narrowing down compounds to likely successful candidates is necessary. Data integration and the application of KGs can provide a feasible computational infrastructure for large scale drug repositioning candidate detection [16], [17], [18]. Such KG based frameworks have gained a lot of attention recently in their possible application for new occurring diseases where a rapid response is required, such as COVID - 19 [16], [20], [125], drug target prediction applications [116] as well as closing the genotype-phenotype gap [126].

Chemical risk assessment

Assessing a compound's toxicity, being it onto the environment or organisms in silico, can significantly reduce costs and time needed to invest in in vivo or in vitro studies. In addition, possible toxic compounds can be discarded, if needed, already during the early development process instead of during later state testing. KGs are a framework that allow the fast screening and assessment of compounds with respect to their possible toxic effects on the environment or organisms as well as can provide necessary background information of specific compounds [118], [127], [128]. Myklebust et al. [127] created the TERA KG to assess chemical toxicity via node embedding. TERA combines chemical information from 3 data sources, toxicity information from ECOTOX (https://cfpub.epa.gov/ecotox/index.cfm) with taxonomy data from 2 data sources. To cirumvere the entity mapping challenge (s. next section), Myklebust et al. [127] used the Wikidata mapping engine (wikidata.org). They evaluated 9 different node embedding models on TERA to show the improvement node embedding can have on the prediction accuracy of neural networks. Zheng et al. [128] showcased the usage of KGs as an integrated data source, where data from unstructured documents were collected through a deep learning based entity recognition system, with the goal to create a unified system, containing information about the effective risk management of hazardous chemicals.

Biological drugs

Biological drugs or biologics are products of living organisms, or contain parts of living organisms, such as recombinant proteins, mRNA-based vaccines, blood components, cells, antibodies, etc. The development of biological drugs has substantially increased in the last years, since they offer many advantages compared to small molecules, especially with regards to their high target specificity [129], [130]. To date, the KG framework has been only marginally exploited in the R&D of biological drugs, there is no specific technical challenge preventing biological drug properties data to be integrated into a KG. Interactions between biologics (e.g. peptides, antibodies or viral nanoparticles) and other already discussed compounds [131], such as cells or gene products, can be modelled in a KG natively due to their relationship focused data. The same applies for associations of these biologics to phenotypes, their ability to bind certain (chemical) compounds, when used as carriers [132], as well as attributes describing their 3D/ 2D structure or makeup. Such a KG can be used to design biologics with desired binding capabilities, with respect to both their target destination and/or their binding compound.

Possible KG application for the toxicological definition of point of departure

While, to our knowledge, there have been no efforts to date to investigate the possibility of KG models for time and/or dose-dependent predictions, such as the identification of safe doses for novel compounds, KG and big data models in combination with experimental data could be promising. Under the assumption that compounds with similar chemical characteristics would exert the same effects, KGs can be exploited to predict effective doses of a new compound for specific experimental conditions. This could be achieved by applying a read-across based approach, where knowledge from structurally similar compounds in the KG is used to infer possible behaviour for an unknown compound. This can be useful to speed up the initial phases of chemical development and increase the success rate of the process (Fig. 4). On the other hand, when dose-dependent modelling of transcriptomic experiments is performed [133], [134], [135], a list of dose-dependent genes with effective doses is identified. KGs can be used to further enrich functional information about these genes, and their interaction with specific chemical structures or target information. Moreover, subgraphs contained in the KG could be used to identify or compare compounds with similar dose-dependent alteration profiles in order to categorise and characterise their effectiveness.

Clinical trials

During clinical trials, large amounts of data are gathered that need to be processed and ideally managed in a way that makes them available for future studies (being lab based or in silico based). By exploring the data model and database side of KGs, these data can be integrated into a KG for easy access and use as well as to link findings to other data, which can be of the same type (e.g., to access frequency or quality of the results) or of a different type (to link it to other types of knowledge). Chen et al. [136] proposed the Clinical Trials KG to combine information about different clinical trials, such as drugs and conditions studied, and evaluated its suitability for drug repositioning (via node embedding) and the identification of similar medical entities (e.g. to find a similar study of a specific study). By combining the Clinical Trials KG with some of the previously mentioned KGs, for example containing drug (structural) information, phenotypic information or information about a compound’s MOA, we believe that for example the linkage of chemical sub-structures to clinical trial outcomes could be possible. By analysing successful trials of similar compounds/ phenotypes, suitable clinical trial set-ups could be suggested by the KG, in addition to leveraging the knowledge gathered during analysis and evaluation of the study. Such a possible workflow of data gathering, creating and analysing during all steps of a compound's development is represented in Fig. 3.

Challenges associated with KG in drug development and chemical safety assessment

The previously mentioned examples showcase the effectiveness of KGs in chemical risk assessment and drug development, however they may not have yet achieved their full potential. Many of these introduced KGs are constructed from a limited amount of data sources, data layers (Table 2) and are problem specific. While context specific KGs are easier to construct and leverage, they are limited in their re-usability to other problem domains. The limitation of data sources and data types further introduces context/ data specific biases into the KG and in result into its analysis. This section outlines multiple challenges associated with KGs, especially for KGs associated with the drug development and chemical safety assessment domain, limiting the potential and growth of current KG systems.

Table 2

Examples of KGs, their size and integrated data layers.

Publication	Problem	Number of Data Layers	Data Layers	KG Size
Zhang et al. [14]	Prediction of Adverse Drug Reactions	3	Drug - Side EffectDrug - TargetDrug - Indication	12,473 nodes154,239 relationships
Al-Saleem et al. [16]	Drug Repositioning for Covid-19	11	Gene - GeneGene - VirusGene - DiseaseGene - Biological ProcessGene - PathwayGene - Molecular FunctionGene - Small MoleculeSmall Molecule - Side EffectSmall Molecule - Clinical TrialClinical Trial - VirusClinical Trial - Disease	> 6 M nodes> 18 M edges
Pavel & del Giudice et al. [20]	Identification of Genes Associated with Covid-19	2	Gene - GeneGene - Drug	27,892 nodes5,964,612 edges
Wang et al. [21]	Prediction of Drug - Drug Interactions	5	Drug - Gene (3 relationship types)Gene - PathwayPathway - Phenotype	NA
Thafar et al. [120]	Prediction of Drug - Target Interactions	1	Usage of multiple benchmarking data sets [137] containingDrug - Generelationships
Mohamed et al. [116]	Prediction of Drug - Target Interactions	1	Usage of multiple benchmarking data sets [137], [138] & a KEGG [72] based one; containingDrug - Generelationships
Abdelaziz et al. [124]	Prediction of Drug - Drug Interactions	At least 6	Drug - GeneDrug - DiseaseGene - GeneGene - DiseaseChemical - PathwayGene - Function	NA
Zhang et al. [125]	Drug repositioning for Covid-19	At least 15	Based on subset of SemMedDB [139] & CORD-19 [140]	331,427 nodes20,017,236 edges
Chen et al. [136]	Collection of Clinical Trial data	21	Meta data & results of the clinical trials but not linked to additional information outside of this data

Examples of KGs, their size and integrated data layers.

Lack of standards on data management and reporting

Successful large scale data integration highly depends on standardizations of individual data sets of the same data type, detailed metadata reporting as well as the accessibility of the data through computational means (e.g. APIs, computational processable reporting formats). Many of the available datasets have been generated independently and for different purposes and therefore vary greatly with respect to their quality, data points, data identifiers and metadata reported, making it challenging to compare and integrate these data sets. While FAIR is a start in introducing standardisation and re-usability of produced data, it has recently been criticised for lacking in quality standardisation [7]. In addition, it is mainly aimed at individual data sets and not towards large scale integration of multiple data sets, which would require in addition guidelines for naming and identification standards, especially across sub-disciplines.

Diversity of standards and ID systems

The biological research field is by tradition a highly fractured field, where a major difference in naming standards, processes and protocols can be found between sub-disciplines [98], [141], making large scale data integration of multiple data sources, especially coming from different sub-disciplines, highly challenging. The basis to solving this problem is not an algorithmic challenge but a semantic one: common naming standards and ID systems need to be generated across the different sub-disciplines as well as extensive computational mapping systems should be provided publicly. In the context of drug development and chemical assessment many different data layers are affected by this same problem of which some examples are outlined in Table 3. These data reporting and identification related challenges are one of the main underlying issues, why large scale and problem unspecific KGs have not been developed yet, yielding mostly low data-layer, low data-source and application specific KGs, not aimed at re-use, as shown in Table 2. This suggests that a lack of semantic definitions and agreement between agencies (e.g. NCBI vs Ensembl) and sub-disciplines has a long lasting impact on the Big Data leverage possibilities of the life sciences. However, while all life science research fields would ideally agree on the same semantic database to be used, this likely is an unrealistic world view. How can you for example globally unify language dependent differences, appoint a single authority that makes decisions for every-one (across sub-disciplines) as well as ensure that such a consortium has unlimited funding and the necessary authority to enforce such a semantic database. Through data integration strategies and (manual) data mapping, it is possible to identify the most shared entities across data sets and to create links between knowledge from different data domains. However, the emphasis is on most, indicating that researchers need to accept that while data integration will provide more data and knowledge it is possible that parts of individual data sets become “unusable” (e.g., through not being able to be mapped to other data source identifiers) or that through automatic entity mapping systems errors will occur.

Table 3

Data integration related challenges for different data types possibly needed in a drug and chemical centred KG.

Data Type	Common Identifiers & Ontologies	Associated Challenges for the Data Integration Task
Chemicals/Drugs/Compounds	SMILECanonical SMILESFingerprintsMolecular descriptorsinchKEyBrand or companyActive principleNameThe Drug Ontology (https://purl.obolibrary.org/obo/dron.owl)	While canonical SMILES are defined, they are not always used in reporting but instead their parent identifiers of simple SMILES are used, which change based on where in the compound structure they are started. Therefore multiple SMILES for the same compound can be created.Depending on the features used to compute chemical fingerprints or molecular descriptors, the same fingerprint/ descriptor can be computed for compounds varying in their 3D structure (e.g. through bond rotation).Drug names are often brand and language dependent, yielding therefore different names for the same compound.
Genes/ Gene products	EntrezEnsemblGene symbolsLocationproteinIDProbe ID	Between different identification systems there is not always a 1-1 mapping available.In addition, different platforms have different algorithms underneath to detect possible genes, making them vary in location, identification and even in what is considered a gene.
Gene Sets	PathwaysDisease AssociationsAOPsGO	Even though for example pathways are defined on a conceptual level, pathways are not 1–1 mappable between platforms.Pathways/ GO terms that are considered the same, may not always have the same gene sets associated with them.Key Events within an AOP are manually created, inducing human error, such as duplicated Key Events due to differences in describing/ naming the underlying event.
Clinical Data/Phenotypes	NameDescriptionOntology of Adverse Events [144]ICDUMLSOMIMMESHOrphanet Rare Disease Ontology [145]LOINCOMOP	Medical terms are often language dependent, making an international mapping challenging. In addition many different “unified” standards have been proposed, which use different terms and classes, indicating that a 1-1 mapping does not always exist, in addition to the challenge that every user will have their own preferences to which naming system to use.For medical professionals, the patient is at the centre and not the re-use of reporting of insights in a computational readable and processable format. Even if computational/ electronic health records are used, their standards vary across disciplines, borders and institutes. In addition their main purpose is to record a patient's health (or specified study) and not large scale, integratable research data.
Celllines / Tissues	NameCell Ontology [146]Cell Line Ontology [147]The BRENDA Tissue Ontology [148], [149]	There is no agreed standard on how to report cell-line or tissue names and especially for commercial cell-lines the names may be producer dependent.

Data integration related challenges for different data types possibly needed in a drug and chemical centred KG. While the Natural Language Processing (NLP) field works fiercely on developing methods to extract information from academic (or free) texts as well as to provide methods to map between terms [142], they are often struggling with the specificity of biological terms and often require manual adjustments. For example, the meaning of a term can change with a single word, such as “not, upregulated, downregulated, increase, decrease”, which will yield a high matching score in the algorithm, even though the terms may actually describe opposite events. The usage of different terms to describe the same “thing”, or the usage of abbreviations [143], also proves challenging for NLP algorithms and often requires them to be provided with a pre-defined dictionary, which needs to be created (mostly) manually [143].

Concept mapping and data linking challenges

Going hand in hand with the challenges outlined in 5.1 and 5.2 another difficulty to overcome in order to make Big Data and KGs suitable for chemical safety assessment and drug development is that data right now is not at the forefront in all sub-disciplines of the life sciences. From a clinical point of view, the patient is at the centre and the re-use of such data in the best case is an afterthought and may result in a case report next to free-text entries in their medical record [150]. While during clinical studies, disease progressions, treatment responses or comorbidities may be outlined and reported, this is often done in writing, which traditionally is challenging to process computationally in combination with the previous outlined challenges. The same can be said for the academic research field though, where experimental outcomes are again often only reported in a publication, and if the data is provided, as outlined in 5.2., a lot of details are getting lost in translation. While the NLP field is working on methods to extract valuable medical information from text [142], [150], there are still multiple draw-backs and challenges associated with it and until now no consensus has been reached on what method may work the most reliable [150]. This puts at the current moment in time the responsibility back towards the data generators, which requires every sub-discipline to realises the value of data, to understand that humans cannot process the amount of data available as well as that data coming from different sub-disciplines only in combination will provide insight into the bigger picture. While data provision and reporting becomes more common, it still needs to become more wide-spread together with a general understanding of computation methods, and data management by every researcher in the field, in order for. individual researchers to make informed decisions on how and what to report. However, we expect this to automatically change over time, with computers playing a large role in the daily lives of current and next generation researchers together with an increase in computational methods taught during their education.

Unavailability of negative data

In order to learn the most from available data, not only positive results should be reported but negative ones as well and integrated into the knowledge base. Commonly, such negative results are not reported and therefore are not available to the wider community, resulting in the loss of valuable information. Therefore researchers should adapt to a more data centred approach [82], with the goal of reporting everything - from metadata to failed approaches. This allows on the one hand to learn negative samples from the data as well as allows other researchers to not waste valuable resources on the same or similar experiments. Many of the previously described KG applications relied on supervised classification tasks [14], [22], [127]. However the life science domain often struggles with the availability of true negative relationships, since from an experimental point of view they are not worth testing or reporting. For example Zhang et al. [14], used in their adverse outcome prediction problem drug indication pairs as negative data points for their classifier. However from a biological point of view a drug's indication and adverse outcome are closely related and may even be dose or situation dependent. This suggests that drug indication pairs and drug adverse outcome pairs are not significantly different from a biological point of view, making them highly unsuitable as substitutes for true unrelated drug phenotype relationships. But without the existence of true negative data points the training and validation of such classifiers stays difficult.

Summary and outlook

This review provided an overview of the advantages of data modelling and explorations by means of graph databases and KGs in the context of chemical safety assessment and drug design. These processes rely on vast and diverse data sets from many different areas in the life sciences. KGs can significantly improve data integration, data re-use, data access and data quality of such diverse data sets. In this review, examples of successful KG applications for different tasks were provided, such as drug repositioning, drug target prediction, drug-drug interaction, its application in clinical trials and for chemical risk assessment. Finally current challenges, which are suggested to hinder KGs to reach their full potential in drug development and chemical safety assessment were outlined. This review suggests a shift in mentality across the multiple sub-fields in the life sciences, towards a data centred approach, where semantic standards, data creation and availability methods and data re-use are at its centre. Additionally, more research into large scale KGs need to be performed, especially for their application into the life sciences. KGs have found widespread use in the technical industry. However the data included in these KGs is often less diverse, has lower variance in quality and the KG usages are of less variance than when KGs are applied in the life sciences. Therefore it is necessary that more research into the applicability domain of KGs, especially for the life sciences, has to be conducted. In conclusion, KGs are emerging as a successful tool for drug & chemical development and their safety assessment. This review suggests that the use of data-driven approaches on top of a KG infrastructure, in combination with a data centred view, can accelerate these processes significantly and solve multiple challenges associated with the compound development process and its safety assessment.

Funding

This study was supported by the Academy of Finland [322761], EU H2020 NanoSolveIT project [814572] and European Research Council (ERC) programme, Consolidator project “ARCHIMEDES“ [101043848].

CRediT authorship contribution statement

Alisa Pavel: Conceptualization, Methodology, Investigation, Writing – original draft, Writing – review & editing, Visualization. Laura A. Saarimäki: Writing – review & editing, Visualization. Lena Möbus: Writing – review & editing, Supervision. Antonio Federico: Writing – review & editing, Supervision. Angela Serra: Conceptualization, Writing – review & editing, Project administration, Supervision. Dario Greco: Conceptualization, Writing – review & editing, Project administration, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

127 in total

Review 1. Model-Based Meta-Analysis: Optimizing Research, Development, and Utilization of Therapeutics Using the Totality of Evidence.

Authors: Vijay V Upreti; Karthik Venkatakrishnan
Journal: Clin Pharmacol Ther Date: 2019-06-14 Impact factor: 6.875

Review 2. Alternative Animal and Non-Animal Models for Drug Discovery and Development: Bonus or Burden?

Authors: Irlan Almeida Freires; Janaina de Cássia Orlandi Sardi; Ricardo Dias de Castro; Pedro Luiz Rosalen
Journal: Pharm Res Date: 2016-11-17 Impact factor: 4.200

3. The SIDER database of drugs and side effects.

Authors: Michael Kuhn; Ivica Letunic; Lars Juhl Jensen; Peer Bork
Journal: Nucleic Acids Res Date: 2015-10-19 Impact factor: 16.971

4. Volta: adVanced mOLecular neTwork Analysis.

Authors: Alisa Pavel; Antonio Federico; Giusy Del Giudice; Angela Serra; Dario Greco
Journal: Bioinformatics Date: 2021-09-08 Impact factor: 6.937

5. What Difference Does Quantity Make? On the Epistemology of Big Data in Biology.

Authors: Sabina Leonelli
Journal: Big Data Soc Date: 2014-06-01

6. Drug-Drug Interaction Predictions via Knowledge Graph and Text Embedding: Instrument Validation Study.

Authors: Meng Wang; Haofen Wang; Xing Liu; Xinyu Ma; Beilun Wang
Journal: JMIR Med Inform Date: 2021-06-24

7. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data.

Authors: Damian Szklarczyk; Alberto Santos; Christian von Mering; Lars Juhl Jensen; Peer Bork; Michael Kuhn
Journal: Nucleic Acids Res Date: 2015-11-20 Impact factor: 16.971

8. PubChem 2019 update: improved access to chemical data.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9. A Meta-analysis of Ecotoxicological Hazard Data for Nanoplastics in Marine and Freshwater Systems.

Authors: Tong Yang; Bernd Nowack
Journal: Environ Toxicol Chem Date: 2020-11-10 Impact factor: 3.742

10. BRENDA, the ELIXIR core data resource in 2021: new developments and updates.

Authors: Antje Chang; Lisa Jeske; Sandra Ulbrich; Julia Hofmann; Julia Koblitz; Ida Schomburg; Meina Neumann-Schaal; Dieter Jahn; Dietmar Schomburg
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971