| Literature DB >> 33508423 |
Andreas Bender1, Isidro Cortes-Ciriano2.
Abstract
'Artificial Intelligence' (AI) has recently had a profound impact on areas such as image and speech recognition, and this progress has already translated into practical applications. However, in the drug discovery field, such advances remains scarce, and one of the reasons is intrinsic to the data used. In this review, we discuss aspects of, and differences in, data from different domains, namely the image, speech, chemical, and biological domains, the amounts of data available, and how relevant they are to drug discovery. Improvements in the future are needed with respect to our understanding of biological systems, and the subsequent generation of practically relevant data in sufficient quantities, to truly advance the field of AI in drug discovery, to enable the discovery of novel chemistry, with novel modes of action, which shows desirable efficacy and safety in the clinic.Entities:
Mesh:
Year: 2021 PMID: 33508423 PMCID: PMC8132984 DOI: 10.1016/j.drudis.2020.11.037
Source DB: PubMed Journal: Drug Discov Today ISSN: 1359-6446 Impact factor: 7.851
Number of data points and/or sizes of data sets available in different domains, namely for images, self-driving cars, astronomy, and in the chemical, biological, and drug discovery domainsa
| Domain | Number of entries | Size of data set (bytes) | Refs |
|---|---|---|---|
| Data sets of images, from autonomous cars and astronomy | |||
| Image Net | 14 × 106 | ||
| Tencent image data set | 20 × 106 | ||
| Tesla (video/sensor-derived information) | 1–450 TB/day/user (assuming 2TB/day/user and 106 users this means ∼2 EB/day, or ∼1 ZB/year) | ||
| Square Kilometre Array (world’s largest radiotelescope) | ∼100 TB/s (∼100 EB/year) | ||
| Data sets in drug discovery | |||
| EMBL: raw data across databases | 273 PB | ||
| REAL database of ‘drug-like molecules’ | 1.2 × 109 | ||
| ZINC database release 15 (purchasable compounds) | ∼750 × 106 | ||
| ChEMBL: compounds with bioactivity annotations (Release 26) | 16 × 106 | 12 GB (Oracle tablespace) | |
| Marketed Drugs (DrugBank v5.1.5) | 13 548 entries (2626 approved small molecules, 1372 approved biologics, 131 nutraceuticals, >6363 experimental drugs) | ||
| DrugMatrix compound with organ-based gene expression data | 627 | ||
| Drugs with DILI annotations | 1036 | ||
| SIDER (drugs with adverse effect annotations) | 1430 drugs (139 756 drug–adverse effect pairs) | ||
| Open Targets Platform (as of April 2020) | 8 462 444 associations spanning 13 818 diseases and 27 700 targets | ||
| Tox21 screening data | Ca 10 000 molecules tested for 72 endpoints (∼50 × 106 data points in total) | ||
| Registry of Toxic Effects of Chemical Substances (RTECS) | Data for ‘more than 160 000 chemicals’ | ||
TB, terabytes or 1012 bytes; PB, petabytes or 1015 bytes; EB, exabytes or 1018 bytes; ZB, zettabytes or 1021 bytes.
Figure 1Illustration of the differences between image recognition and classification tasks in the chemical and biological drug discovery domains. When classifying images (and also speech), the model architecture and representation of object are more integrated than when using chemical and biological data, and labels can be assigned relatively less ambiguously. In the chemical domain, the best representation of an object is generally unknown (different aspects of a chemical are responsible for different types of effect, and some might be related to the functional group, others related to surface properties, etc.), whereas, in the biological domain, it is not clear which type of information provides information related to which endpoint. Common to the chemical and biological domains is that labels depend to a large extent on the set-up of a particular experiment, even if the same thing is measured ‘in principle’.
Comparison of data and representations in the image, speech, and chemistry/biology domainsa, b
| Domain | Representation relevant for objective | Representation comprehensive | Underlying distribution known | Sampling of underlying distribution | Conditionality of data | Quantitative dependence of label on external context |
|---|---|---|---|---|---|---|
| Images | Pixels describe object (but dependent on orientation) | Yes within domain (images contain all information about visual object) | No | Biased but good (large data sets available) | Partial | None (labels can be assigned in binary fashion) |
| Speech | Yes (waveform captures all aspects of speech) | Yes | No | Biased and good (large data sets available) | Partial (context); local and global structure | None (words can be assigned entirely based on waveform) |
| Chess/GO | Yes (locations and functions of pieces are fully defined) | Yes (positions of pieces entirely describe state of system) | Can be calculated in principle, because there is a large but finite set of movements | Can be exhaustively sampled (in principle) | No | N/A |
| Drug discovery: chemistry | Depends on context: which features/representation of compounds is relevant is often unknown | Partially (conformations, protonation states, etc. are frequently unknown) | No (chemical space not known in its entirety; can only be calculated as approximation) | Biased and small (100 s; up to 106–109 out of 1063 | Partially (e.g., lipophilicity depends on protonation states, etc.) | Depends on context |
| Drug discovery: biology | Which aspect of biology contains information for which endpoint is frequently unknown | No (level of biological type of data generated, temporal, and spatial domain not explored) | Very partial (e.g., amino acid distributions in evolution) | Biased (depends heavily on experimental set-up) | Yes (e.g., gene expression depends on treatment, cell type, etc. | Very large (biological system is heavily influenced by system, experimental set-up) |
Given recent successes of AI in the games Chess and GO, these are included for comparison. It can be seen that chemical, and in particular biological, systems are difficult to describe, given that data can be generated on many different layers (genes, proteins, etc.); representations are not comprehensive; sampling is low; and data depend significantly on the condition of the system, while quantitative aspects abound (e.g., different concentrations of a chemical can lead to entirely different biological responses).
The colour scheme indicates in which cases data and representations are expected to cause relatively few problems in computational models (green), an intermediate problems (yellow), or large problems (red) due to either high dimensionality, incomplete data, or incomplete definition of the problem in a given representation, or due to other reasons.
Figure 2The positive predictive values (PPV) of target–adverse event associations against the hit rate or recall (i.e., the fraction of drugs associated with the adverse event also being active at an individual protein target). Activity calls were made based on the ratio of the in vitro bioactivity and the unbound plasma concentration. Target–adverse event pairs with a high PPV tend to have a low hit rate, meaning only a small share of all drugs associated with the adverse event would be picked up by the bioactivity at the target. Alternatively, a high hit rate is associated with a low PPV, indicating a high false positive rate for that target–adverse event combination. Thus, overall, there exists no clear 1:1 relationship between on-target activity and observed adverse events after compound administration. Abbreviations: ADRA1B, α1b adrenergic receptor; ACE, angiotensin-converting enzyme; CHRM1/2/3, muscarinic acetylcholine receptor M1/2/3; PTGS1, cyclooxygenase-1; DRD2, dopamine D2 receptor; FAERS, US Food and Drug Administration Adverse Event Reporting System; HTR2A, serotonin 2a (5-HT2a) receptor; HTR2C, serotonin 2c (5-HT2c) receptor; KCNH2, hERG; SIDER, SIDe Effect Resource.
Data available at different stages of drug discovery, and different problem settingsa
| Data availability | Problem setting | ||
|---|---|---|---|
| Proxy data (usually | Efficacy/safety data (usually | Proxy data (usually | Efficacy/safety data (usually |
| Often ‘simple’ readouts (e.g., activity on protein) | Quantitative data (dose, exposure, etc.) | Discovery setting: ‘find me suitable 100 s or 1000s out of 1 million’ (e.g., prioritization, screening) | Need to predict for this particular data point (molecule) |
As it has been shown before for quantity, the quality of models for decision-making has significant impact on project success. This has profound implications for the use of predictive models in the context of AI in drug discovery, where, thus, models with sufficient inherent performance have to be used, and which are trained on endpoints of relevance for the in vivo situation (which is not necessarily a given for every endpoint).
Different types of chemical and biological information utilized in drug discovery, along with a brief description of where and why labeling data is nontrivial in each domain
| Data type | Representation | Difficulty/shortcoming | Resulting problem in AI applications |
|---|---|---|---|
| Chemical structure | 1D descriptors, fragments, graphs, pharmacophores, surfaces, etc. | Descriptor choice subjective (not known beforehand/trial and error), etc. | No problem-inherent representation |
| Biological data | |||
| Activity on protein target | Single number (e.g., IC50, Ki/Kd, etc.) | Functional pharmacological effects incompletely characterized | Ligand–protein labels are both incomplete (not available for all combinations) and heterogeneous (stem from different types of endpoint measured) |
| Mode of action | Target, pathway, functional level, etc. | Can be defined on different levels, there is no intrinsic ‘mode of action’ of a compound | Labels heterogenous (because concept is not properly defined itself; see also ketamine case study in |
| Gene expression data | Up- and downregulation of individual genes (or pathways) | Noisy | High-dimensional and noisy input space (∼20 000 dimensions) |
| Cellular imaging readouts | Images themselves, or explicitly defined features | Dependent on parameters (cell line/system, time point, dose) | Not trivial to infer biological meaning (and select relevant features) |
| Physiological data | |||
| PK data | Concentration over time per tissue type | Difficult to generate (especially for tissues that are difficult to access, e.g., brain, lungs) | Generally too few PK data available (almost none for tissues) |
| Animal endpoints | e.g., clinical parameters, histopathology | Biological variation | |
| Clinical endpoints | Organ-based endpoints (eg DILI); adverse events such as in MedDRA, disease annotations such as from ICD-10, etc. | Endpoint definitions are partially overlapping and cannot be assigned clearly (e.g., DILI depends on dose etc.) | |
aIt can be seen that little of the information in the field can easily be assigned labels, be it from the chemical or biological domain, with resulting difficulties to using such data for applying AI approaches in drug discovery.
Figure 3Illustration of a scientific question, or hypothesis, at the basis of data generation. The hypothesis leads to the generation of relevant data for a given question, which are represented in a signal-preserving manner; and which are then analyzed using a method that is able to handle the signal in the data. A method cannot save an unsuitable representation, which cannot remedy irrelevant data, for an ill thought-through question. This principle needs to be at the basis of data generation for making true use of ‘artificial intelligence’ in drug discovery.