Literature DB >> 36051875

On modeling and utilizing chemical compound information with deep learning technologies: A task-oriented approach.

Sangsoo Lim¹, Sangseon Lee², Yinhua Piao³, MinGyu Choi^4,5, Dongmin Bang⁶, Jeonghyeon Gu⁷, Sun Kim^3,7,8,5.

Abstract

A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.

Entities: Chemical

Keywords: Chemical information modeling; Chemical space; Computer-aided drug discovery; Data augmentation; Deep learning

Year: 2022 PMID： 36051875 PMCID： PMC9399946 DOI： 10.1016/j.csbj.2022.07.049

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Chemical space in drug discovery refers to a collection of chemical compounds satisfying a certain set of properties, and definitions of chemical space vary widely depending on the criteria [1], [2], [3], [4]. Out of theoretically possible drug-like chemical compound space, as large as compounds [5], the number of known/identified chemical compounds ranges from thousands in DrugBank to tens of millions in PubChem or ZINC, depending on definitions of drug-like chemical space [6], [7]. Before the extensive use of Computer-aided drug discovery (CADD), Lipinski’s rule of five (RO5) has long been accepted as the general rule for drug-likeness of a compound [8]. Recent studies reported that there are drugs like Atazanavir, Erythromycin, or Sirolimus that disobey the RO5 in extended/beyond the RO5 space [9], [10]. These examples show the difficulty of defining chemical space in terms of specific criteria for chemical compounds. Our current knowledge on properties or characteristics of chemical compounds is still not enough to define chemical space. One promising alternative is to learn chemical space directly from data. There have recently been remarkable advances in artificial intelligence technologies, deep learning in particular, and these technologies have been successfully used to model properties or characteristics of chemical compounds in the context of “tasks” such as absorption, distribution, metabolism, excretion, or toxicity (ADMET) predictions. Defining chemical space in terms of tasks is a supervised learning where tasks are defined quantitatively using the assay data or clinical data. Since chemical compounds were difficult to handle as graph representations, common approaches to represent chemical compounds are to use linear representation, e.g., SMILES, or fingerprint-based representation, e.g., MACCS. Traditional feature-based methods, such as random forest or support vector machine, take these representations as input and identify important features that are effective in performing given tasks for chemical compounds. Deep learning models also use linear or fingerprint representations of compounds as input but learn alternative representations of chemical compounds as embedding vectors (See Section 3) when accomplishing given tasks. One major advantage of embedding vectors is that they can be used to compare compounds more effectively by computing similarity among embedding vectors. This is one of the reasons why recent deep learning models outperform traditional feature-based methods. Another important use of embedding vector of chemical compounds is to build chemical space in more general settings before addressing downstream tasks. This general representation of chemical space, known as “pre-training” strategies, can be then specialized for specific tasks. Even with powerful representations of chemical compounds, it is also important to incorporate valuable traditional knowledge such as chemical properties measured by assays, thus many deep learning-based methods use parallel architecture of learning chemical space and utilizing chemical properties to achieve prediction power for specific tasks. Our survey paper is to summarize this new development in a single article so that drug research community can reach a better understanding in these technologies better and utilize recent computational methods more effectively. We organize the survey as summarized in Fig. 1. Tasks that can be accomplished with chemical information are summarized in Section 2. Recently developed deep learning methods are summarized in terms of technical methods and also tasks in Section 3. Computational methods for creating general representation of chemical space, known as pre-training strategies, are summarized in Section 4. Section 5 discuss how much improvement can be made when traditional knowledge such as chemical properties measured by assays is incorporated into deep learning models. Finally, we discuss four topics important for developing more powerful methods that can be used to accomplish drug discovery tasks more accurately.

Fig. 1

Overview of the present review on building a chemical space using deep learning methods. In Section 2, benchmark tasks on drug discovery are introduced. In Section 3, several deep learning methods are introduced with selected state-of-the-art approaches. In Section 4, discussions on how to build an improved representation space are made in terms of self-supervised, generative, and mixup methods. Finally, in Section 5, features are introduced that can provide additional information other than structural formats such as gene expression or physico-chemical properties.

Tasks: What Can We Do with Chemical Compound Information?

Chemical compound information can be used by computational methods to accomplish tasks such as ADMET prediction. Evaluation on the performance of computational methods requires accurately labeled databases with qualitative and quantitative information for a spectrum of chemical properties. There are several benchmark databases that have been developed for these purposes. MoleculeNet [11] is a representative database used to develop machine learning models on chemical data. MoleculeNet is a database organized in four different categories such as physiology, biophysics, physical chemistry and quantum mechanics. As the database is constructed to help develop molecular machine learning methods, it has been widely used by state-of-the-art deep learning methods as standard evaluation criteria (See Table S1). Therapeutics Data Commons (TDC) is recently released to focus more on drug discovery tasks for small molecules, peptides and other biological entities [12]. TDC re-organizes small molecule benchmark tasks mainly into ADMET categories for more task-oriented computational method development. In this section, we focus on ADMET prediction tasks, a set of important criteria to be coordinately optimized for determining the efficacy and selectivity of drugs because they are still considered as major hurdles in drug discovery [13], [14], [15], [16]. Table 1 summarizes ADMET benchmark datasets widely used in computer-aided drug discovery. Both absorption and toxicity datasets have been primarily used as benchmark data. There are several reasons for the popular use of these specific datasets. First, as data labels are experimentally determined, the maturation of benchmark dataset is in a close line with that of the experimental techniques. Second, the scope and the amount of data points highly depend on the availability to the public. Third, tasks such as solubility are chemical properties that can be determined straightforward by first-principles knowledge.

Table 1

Chemical tasks in drug discovery. The data imported from [12]. (Binary: Binary classification, Reg: Regression)

Task	Dataset	Size	ML Type	Reference
Absorption	Caco-2 (Cell Effective Permeability)	910	Reg	[17]
	HIA (Human Intestinal Absorption)	578	Binary	[18]
	Pgp (P-glycoprotein) inhibition	1,218	Binary	[19]
	Bioavailability	640	Binary	[20]
	Lipophilicity	4,200	Reg	[11]
	Solubility	9,982	Reg	[21]
	Hydration Free Energy	642	Reg	[11], [22]
	Subtotal	16,558

Distribution	BBBP (Blood–Brain Barrier Permeability)	1,975	Binary	[11], [23]
	PPBR (Plasma Protein Binding Rate)	1,797	Reg	[24]
	VDss (Volume of Distribution at steady state)	1,130	Reg	[25]
	Subtotal	4,678

Metabolism	CYP P450 - 2C19 Inhibition)	12,665	Binary	[26]
	CYP P450 - (2D6 Inhibition)	13,130	Binary	[26]
	CYP P450 - (3A4 Inhibition)	12,328	Binary	[26]
	CYP P450 - (1A2 Inhibition)	12,579	Binary	[26]
	CYP P450 - (2C9 Inhibition)	12,092	Binary	[26]
	CYP2C9 Substrate	666	Binary	[27], [28]
	CYP2D6 Substrate	664	Binary	[27], [28]
	CYP3A4 Substrate	667	Binary	[27], [28]
	Subtotal	16,877

Excretion	Half Life	667	Reg	[29]
	Clearance (microsome)	1,102	Reg	[24], [30]
	Clearance (hepatocyte)	1,020	Reg	[24], [30]
	Subtotal	1,592

Toxicity	LD50	7,385	Reg	[31]
	hERG blockers	648	Binary	[32]
	hERG Central	306,893	Binary/Reg	[33]
	Ames Mutagenicity	7,255	Binary	[34]
	DILI (Drug-Induced Liver Injury)	475	Binary	[35]
	Skin reaction	404	Binary	[36]
	Carcinogens	278	Binary	[28], [37]
	Tox21	7,831	Binary	[38]
	ToxCast	8,576	Binary	[39]
	ClinTox	1,484	Binary	[40]
	Subtotal	327,133

Total		349,036

Chemical tasks in drug discovery. The data imported from [12]. (Binary: Binary classification, Reg: Regression)

Absorption

Drug absorption tasks are about how effectively a drug engages into the human biological system. Among the datasets in Table 1, lipophilicity [11], hydration free energy [22] and solubility [21] are widely-used. In general, it is recommended to decrease lipophilicity and increase solubility because higher lipophilicity often leads to higher rate of metabolism, poor solubility, higher turn-over, and absorption [41]. However, poor water solubility could lead to slow drug absorption, inadequate bio-availability and induce toxicity [42], [43]. Although other datasets like HIA [18] or Pgp [19] are also well established to investigate gastrointestinal or intestinal absorption [44], [45], they are not as commonly used as lipophilicity or solubility datasets.

Distribution

Drug distribution tasks deal with how effectively an absorbed drug can be delivered to desired targets. Blood–brain barrier permeability (BBBP) dataset contains binary labels whether a drug penetrates the brain barrier [23]. Because the brain barrier blocks most foreign molecules, drugs targeting the central nervous system should be permeable to this barrier [46]. Plasma protein binding rate (PPBR) [47] is a regression task of predicting binding rates of drugs to plasma proteins like Albumin. In general, more weakly bound drugs more effectively traverse to the site of actions [48]. However, the two tasks are barely used in CADD because mechanistically, they are not determined by chemical structure only. Both blood barrier penetration and plasma protein binding are related to secondary biological mechanisms - adsorptive-mediated transcytosis, and binding with Albumin, respectively. For BBBP dataset as listed in Table S1, most of recent studies utilized graph neural network (GNN) or Transformer architectures.

Metabolism

Drug metabolism tasks assess whether a drug is efficiently metabolized to show desired efficacy without adverse side-effects. Predicting whether a drug inhibits or reacts with proteins in CYP 450 systems is a representative task [26], [27], [28]. Because the CYP 450 enzymes play crucial roles in the breakdown of xenobiotics, a drug that inhibits these enzymes would cause decreased metabolistic potential, which ultimately leads to drug-drug interactions and adverse effects [49], [50], [51]. Drug metabolism datasets have gained minor attention because an interaction between CYP proteins and chemical requires CYP protein structures. Moreover, even if a drug candidate inhibits a specific CYP protein, further biological network analysis is crucial to determine whether the drug causes adverse effects, which is still an open problem [52]. Nevertheless, recent studies addressed CYP datasets by developing a deep featurization strategy to overcome the drawbacks of molecular fingerprints [53], [54]. The key to improvement in such methods was using multi-task learning framework to leverage structural diversity from other tasks in prediction of each of the tasks.

Excretion

Drug excretion tasks are about the rate at which an active drug is removed from the body. Half life is the dataset of measured duration for the concentration of the drug in the body to be reduced by half [29], [55]. Drug clearance is defined as the volume of plasma cleared of a drug over a specified time period [24], [30], [56]. Although pharmacokinetics of drugs is crucial for determining the dosage of a drug, excretion dataset is not widely used in CADD because in vivo measurement of half life or drug clearance is time-consuming and expensive .

Toxicity

Toxicity tasks are to predict potential toxicities of drugs in humans. Toxicity is one of the primary causes of compound attrition, early and accurate prediction of toxicity can significantly accelerate the drug discovery and boost the likelihood of being marketed [58]. As toxicity covers extensive area of biological toxicity from heart toxicity (hERG), liver toxicity (DILI), to carcinogenesis, consortium-level efforts to characterize human toxicity experimentally are launched: Tox21 [38], ToxCast [39], and ClinTox [40]. They contain an extensive amount of data compared to others - 7,831, 8,576, and 1,484 compounds, respectively. For Tox21 dataset as listed in Table S1, many of recent studies utilized GNN or Transformer architecture.

Tasks of Generating Novel Compounds

The goal of generative models is to derive a previously unknown, synthesizable compound with desired chemical properties by utilizing the prior knowledge from a large-scale chemical database such as ZINC or ChEMBL (Section 3.5). Tasks in generative models are to generate list of chemical compounds suitable for experimental validation [59]. There are benchmark platforms for molecule generation tasks, such as MOSES [60] or GuacaMol [61]. These platforms suggest quantitative metrics to assess the performance. For example, basic metrics include validity, uniqueness, and diversity to compare statistics of the chemical distribution between generated and existing compounds. Molecular property statistics, such as partition coefficient, drug-likeness, and synthetic accessibility, are also used to evaluate performance. Some models report pharmacochemical filter scores (Glaxo [62], SureChEMBL [63], or PAINS [64], for example) which are the ratios of valid molecules without toxic or reactive functional groups to total generated molecules. Recently, from a multi-modality point of view, a three-dimensional (3D) molecular design task that takes 3D inter-atomic distance information into account is also being tackled. Models for this task use a quantum mechanics dataset, such as QM9 [65], [66], that contains geometries minimal in energy, harmonic frequencies, energies, and so on. The performance for this task is usually assessed by aforementioned basic metrics and chemical stability. In Zhavoronkov et al. [67], deriving a potent lead compound for DDR1 kinase inhibition was completed within 46 days by developing and utilizing a generative deep learning framework that creates a chemical space with the ZINC clean leads dataset [68] of 4,591,276 molecules and then models the properties of both known DDR1 and common kinase inhibitors (References for datasets in Table S1 of [67]). In Merk et al. [69], two million compounds from ChEMBL22 [70] were used for pre-training by LSTM to generate lead compounds optimized for RXR and PPAR agonists using RXR [71] PPAR [72] datasets.

Bioactivity and Other Benchmark Tasks

Using diverse datasets from different perspectives, we can evaluate the generalizability of ML models. As such, there are benchmarks other than ADMET closely related to drug discovery. Examples of such bioactivity datasets are BACE, SIDER, MUV and HIV. MUV dataset [11], [73] is a subset of PubChem BioAssay [74] that consists of 17 target tasks over 90 thousand chemicals. The aim of this dataset is to validate virtual screening methods. HIV dataset [11], [75] was created by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, to test over 40 k molecules potential to inhibit HIV replication. This dataset is widely used as for recently developed GNN and Transformer models. BACE dataset [11] is a collection of 1,522 compounds with binding results of human beta-secretase 1. Another major stream of benchmark task is quantum mechanical property prediction. Since molecular properties and chemical reactions are determined by electron configurations and their changes, predicting quantum mechanical properties to describe electronic states is important in drug discovery tasks. Ab initio quantum calculation tools such as AMBER [76], Avogadro [77] or CHARMM [78] can provide reliable approximations on modeling molecules, but they require a large amount of time to compute molecule models, which is not suitable for screening many drug candidates. QM7/7b [79], QM8 and QM9 [65] predict quantum mechanical properties of molecules from its three-dimensional structures. The most representative task, QM9, utilizes 134 k stable small organic molecules which are made up of CHONF while possessing up to nine heavy atoms (CONF). QM9 produces the most stable geometry of molecules and 12 quantum mechanical properties corresponding to the conformer, such as harmonic frequencies, dipole moments and polarizabilities. There also exist other types of quantum mechanical tasks including ANI-1 (Accurate NeurAl networK engINe for Molecular Energies - 1) task for potential surface prediction [80] and MD17 (Molecular Dynamics 17) for predicting energy-conserving forces [81], [82]. These tools can be helpful for investigating stable compound conformers that are important to dock with target proteins [83]. Many computational tools have developed for lead optimization [84], hit discovery [85], and docking simulation [86]. However, these tools do not use deep learning technologies and we just refer to major survey papers here. For selected ADMET benchmark datasets, the performance of the selected deep learning methods is summarized and displayed in Table 2 and Supplementary Tables, [233], collecting results from the literature. The details of the selected methods will be discussed in Section 3).

Table 2

Chemical tasks in Toxicity. We report ROC-AUC scores for Tox21 dataset. For acronyms used in “Data” column, ‘S’ refers to smiles string, ‘G’ refers to molecular graph, ‘F’ refers to molecular fingerprint and ‘P’ refers to molecular properties. In “Model”, the results of the methods from ‘M Model’ are reported from MoleculeNet; ‘T’ refers to transformer-based methods; ‘G’ refers to graph-based methods; ‘R’ refers to RNN-based methods; ‘C’ refers to CNN-based methods; ‘S’ refers to shallow embedding methods.

Type	Name	Performance	Data	Model	Year	Ref
Machine Learning in [11]	Logistic Regression	0.781	S	M	2018	[11]
	IRVa	0.796	S
	XGBoost	0.815	S
Deep Learning in [11]	Weave	0.807	G	M	2018	[11]
	TextCNN	0.838	S
	GraphConv	0.850	G
Deep Learning	ChemBERTa	0.728	S	T	2020	[95]
	MICRO-GRAPH	0.770	G	T	2020	[96]
	MoCL	0.780	G	G	2021	[97]
	MolCLR	0.789	G	T	2021	[98]
	SMILES2Vec	0.810	S	RC	2018	[99]
	KCL	0.813	G	T	2021	[100]
	Transformer-CNN	0.820	S	CT	2020	[101]
	DMPNN	0.850	G	G	2019	[102]
	PotentialNet	0.857	G	G	2018	[103]
	FraGAT	0.860	G	G	2021	[104]
	TrimNet	0.860	G	G	2021	[105]
	Mol2Context-Vec	0.860	FP	R	2020	[106]
	CMPNN	0.860	G	G	2020	[107]
	MPAD	0.860	SG	SG	2020	[108]
	GAP	0.880	G	G	2019	[109]
	FP2Vec	0.880	F	CT	2019	[110]
	SA-MTL	0.900	SP	CT	2021	[111]
	TOP	0.950	SP	R	2020	[112]

(Influence Relevance Vector)

Deep Learning Technologies: How Well Can We Accomplish the Tasks with Chemical Information?

Deep learning models take chemical information in various formats. First, string formatted representations include the SMILES [87], SMARTS [88] and SELFIES [89]. Composition of substructures like functional groups is represented as chemical fingerprints, a form of binary vector based on the existence of specific chemical structural features (e.g. number of aromatic rings). According to the recent review [90], molecular fingerprints can be divided into three different categories: substructure keys-based (MACCS, PubChem, BCI, and TGD), topological or path-based fingerprints (Daylight or OpenEye’s Tree), and circular fingerprints (Molprint2D, ECFP, FCFP). Among many fingerprints, three fingerprints are mainly used in machine learning: PubChem [79], Morgan [91], and MACCS keys [92]. Since compounds consist of atoms and covalent bonds; recently chemical graph representations have been utilized as input to graph deep learning methods (See Section 3.4). Once information of a compound is provided, machine learning models identify important features of the compound in the context of tasks to be accomplished. Often multiple features need to be combined to model compound information in task specific ways. Traditional feature-based machine learning models have limited success in capturing complex relationship of multiple features. Deep learning methods have ability to learn complex relationship directly from data, although many deep learning models are criticized for being blackbox models. Thus, there have recently been many successful examples of deep learning models to accomplish tasks in drug discovery more accurately. Attention-based models, recent developments in deep learning technologies, can be used to overcome the blackbox nature of deep learning models. For example, in toxicity prediction tasks, the presence or absence of a toxicophore in a chemical compound is important for its toxicity label [93], [94]. Thus, Convolutional Neural Network (CNN)-based models focus on learning local patterns for toxicity tasks and other models that are based on GNNs or transformers exploit spatial information to learn features related to toxicophore. On the other hand, in the case of solubility prediction, the dipole moment given the degree of non-uniform distributions of positive and negative charges of a molecule is one of the important factors. Thus, multiple deep learning methods are being tried in various ways to learn the entire structural representation of a molecule. We summarize the performance of various ML and DL methods on selected benchmark tasks (Table 2) and Supplementary Tables which deep learning methods are used for which tasks in Supplementary Table S1. Different architectures can capture different views about chemical compounds: (1) a sequence view and (2) a graphical view. When a chemical is considered as a sequence, CNN and Recurrent Neural Network (RNN) can capture the local sequence patterns of chemical strings and Transformer considers all pairwise interactions between chemical string elements to embed valid chemical representations. On the other hand, GNN is a well-suited architecture for learning molecules with a graph view, using the a priori topology of the molecular graph to transfer information between nodes and summarize the graph-level representation of molecules. Moreover, to explore more unknown chemical substances with effective representations, Reinforcement Learning (RL) navigates the huge chemical space and generates new representations by learning effective search strategies, which can reflect the properties of the chemical substances in a specific task. In this section, we review each of deep learning technologies in the context of tasks in drug discovery.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) is a method to capture local information within a specific window of data by either average pooling or maximum pooling to produce reduced representation of feature map. This strategy naturally provides regularization on conventional artificial neural networks (ANNs) and, additionally, the ability to learn hierarchical pattern of given data. CNNs are widely used in image recognition [113], [114], [115], [116], [117] due to their excellent ability to learn important features from image, which increases learning efficiency over ANNs. For molecule representation learning using CNNs, the input molecule is considered as SMILES-like string. A molecule can be represented by a matrix , where n denotes the number of elements on the SMILES, and d denotes dimension of the elements. To learn the sequential pattern on SMILES, most CNN-based approaches use 1d CNNs, which is different from 2d CNNs that are used for learning 2d patterns from image data. Specifically, 1d CNNs slide convolutional filter on the X to learn local patterns of SMILES by convolutional operation and extract the effective representations by pooling operation. CNNs have been widely used in drug discovery [118], [119], [120], [121], [122], and used to discover patterns related to their properties [110], [118]. As CNN kernels are designed to capture localized patterns in SMILES input strings and aggregate the patterns into the final prediction, CNN is favored by tasks for applications where substructures contribute to the molecule-level properties such as solubility, hydration free energy, and lipophilicity. CNN-based methods take input compounds as SMILES, some in combination with RNN or Transformers, to tackle absorption and toxicity benchmark tasks (Table S1). A recently developed method, SA-MTL [123], achieved high performance in BBBP and Tox21 datasets at Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.950 and 0.900, respectively, while outperformed other methods such as SMILES Transformer [124] (AUC-ROC: 0.802), Transformer-CNN [101] (AUC-ROC: 0.82), or BiLSTM-SA [125] (AUC-ROC: 0.842). The authors demonstrated in the ablation study that the performance margin of 0.102 was contributed the most by using self-attention layer. They also suggested for highly imbalanced dataset using a max-pooling layer rather than a discrete output layer in a simple CNN model. FP2VEC [110] achieved AUC-ROC of 0.880 in Tox21 dataset by employing a multi-task learning framework. ConvS2S [126] improved the performances in various datasets including solubility, BBBP, and HIV datasets by suggesting SMILES augmentation scheme.

Recurrent Neural Networks

Recurrent Neural Network (RNN) is useful for learning relational data or capturing sequential/temporal information because output from the previous state is fed into the current state. Similar to CNNs, RNNs have also been widely used to investigate sequentially formalized molecular representations (e.g., SMILES) [106], [127]. Compared to CNN (See Section 3.1), RNN and its variants (e.g., LSTMs or GRUs) can capture long-range relationships among chemical elements in SMILES due to their innate recurrent memory mechanism. However, RNNs are not suitable to capture localized patterns such as functional motifs. Therefore, RNNs and CNNs are often used together to complement each other [99], [128], [129] in chemical domains. Depending on how to handle SMILES-like representation, existing works can be grouped in three categories such as atom-level sentence, substructure-level sentence, and SMILES augmentation. A SMILES string can be naturally considered as sentences of atoms with auxiliary symbols (e.g., double bond, branch, or ring). Atom-level sentence [99], [127] considers each symbol as input features of the model. Though this approach can directly use RNN or CNN architectures, it overlooks the fact that multiple atoms and bonds form one functional group. To overcome this limitation, substructure-level sentence [106], [110] considers a compound as a sentence of the substructures of the compound. The substructures can be obtained by using any chemical fingerprints [91], [130] or fragmentation algorithm [131]. A SMILES string is just one of many possible views on a chemical compound and it is possible to use multiple SMILES representations of a compound, called SMILES augmentation [35], [128], [129], [132]. RNN based methods have been widely used for tasks in absorption category and Tox21 datasets (Table S1). TOP [112] achieved an excellent performance in toxicity prediction using Tox21 dataset (mean AUC-ROC: 0.950) by integrating shallow representation on SMILES into biGRU in combination with some molecular descriptors (logP, MW and TPSA). By incorporating the physiochemical properties, TOP resulted in 0.195 performance gain in terms of AUC-ROC. Li et al. [128] achieved comparable performance to existing methods [99], [129] in the solubility task. Meanwhile, Mol2Context-vec [106] outperformed other RNN based methods in most benchmark tasks (solubility, lipophilicity, BBBP, and BACE). The authors suggest that learning molecular descriptors such as logP or TPSA solely from SMILES is difficult. Thus, additionally providing such features contributed to the performance improvement.

Transformer

Transformer performs sequence-to-sequence translation tasks in an encoder-decoder framework. Self-attention mechanism [133] is a core component of the transformer architecture that uses all token pairs to encode contextual information to learn global representations of sequences. Because of this modeling power, Transfomer has been successfully used in the field of natural language processing [134], [135]. In computer vision domain, a vision Transformer [136], [137] achieved outstanding performance for various machine vision problems. Following the great success of transformer in computer vision and natural language processing domains, several transformer based models have been proposed for efficient chemical representations. Leveraging the capability of transformer as an encoder, it is usually pre-trained on massive unlabeled chemical compounds either in the form of SMILES or molecular graph, which leads to outstanding performances in downstream tasks such as absorption, distribution, and toxicity prediction [101], [123], [138], [139], [140]. The crucial point of the chemical transformer is fully exploit atom interactions and chemical structure information through self-attention mechanisms. SMILES-BERT [139] and ChemBERTa [95] embedded chemical SMILES based on transformer or BERT (Bidirectional Encoder Representations from Transformers) [141] to pre-train the semi-supervised learning model on unlabeled large scaled data, where long range atom-level interactions can be learned. Task-specific models by fine tuning the pre-trained model using additional data for downstream tasks improved prediction performance for a number of tasks. For example, SMILES-BERT improved LogP prediction performance by about 2% compared to existing SOTA Seq2Seq-based methods which indicates better utilize the unsupervised information with the Masked SMILES Recovery task, and gets more than 5% and 8% improvement on PM2 and PCBA tasks, respectively. Transformer-CNN [101] and SA-MTL [123] incorporated CNN and transformer to capture localized chemical substructures and learn interactions between substructures. SA-MTL achieved AUC-ROC of 0.900 on toxicity tasks achieving around higher than existing deep learning models in toxicity prediction and also achieved the highest performance on distribution task (e.g., BBBP dataset) and other tasks (e.g., HIV, SIDER). Transformer can learn chemical information in the form of molecular graphs as well as SMILES. MolAT [138] augmented the self-attention between atoms with chemical structural information: (1) adjacency on molecular graph and (2) inter-atomic distances. MolAT outperformed the SMILES-based models in predicting various molecular properties, such as water solubility, BBBP, and metabolic stability prediction. ST-KD [142] proposed an end-to-end SMILES transformer by knowledge distillation in a graph transformer without pre-training. Knowledge distillation can transfer knowledge from a teacher model to a student model. In ST-KD, the teacher model (graph transformer) is trained first, and then the output of the graph transformer and the real labels of the data are used to train the student model (SMILES transformer). Student model (SMILES transformer) can learn structural information of molecules since the hidden representations and attention weights of the distillation are focused on the information learned by the teacher from the molecular graphs. ST-KD showed outstanding performances on QM datasets against graph-based and SMILES-based models. It demonstrated that efficient chemical representations learned by knowledge distilled transformer can capture more global information than graph-based representations. It also indicated that global information is more appropriate for QM datasets. Meanwhile, on FreeSolv and HIV datasets, ST-KD showed competitive results with graph-based models, indicating that these tasks are more likely to focus on local graph structures.

Graph Neural Networks

Graph Neural Networks (GNNs) are most well suited for processing graph data. GNNs can be used to predict relationship among the members in social networks [143], and can inference biomarkers using biomedical networks [144], [145] (e.g., protein–protein networks). Message passing operation, a core operation in GNN, is to learn feature information propagated on graphs. Specifically, GNN aggregates information from neighboring nodes of node v and uses the information to update the representation of node v. To represent a graph G, a readout function is then used to summarize all updated node representations to an 1d vector representing the graph. Since chemical compounds are complex 3D structures of atoms and bonds, it is natural to represent them as graphs. To represent chemical compounds in GNN, popular approaches [102], [107], [146], [147] are to design the message passing algorithm to learn node/edge representations and aggregate them as a molecule representation. However, learning molecule representations this way often results in summarizing local proximity information. Recent studies [104], [148], [149], [150] have begun to use message passing non-locally. Some works [148], [149] addressed this problem by message passing on r-radius subgraphs to learn a more global representation for molecule graph. Others [104], [150] explicitly extract knowledge-guided subgraphs from the molecules and make representations of them, indicating that domain knowledge can be used to reflect global molecular properties for better graph-level representations on chemical property prediction tasks. Depending on view of chemical structures, proposed GNN based approaches can be divided into three categories: atom/bond-level, subgraph-level, or graph-level. Given a molecular graph, atom/bond-level GNNs [102], [107], [146], [147] aggregate information on a target atom with adjacent atoms and bonds. MPNN [147] used long-range interactions on gated graph neural networks [151] for molecule prediction tasks using QM9 dataset and outperformed existing strong baselines without explicit feature engineering. In follow-up studies, CMPNN [107] used a node-edge interaction module where information can pass between node and edge. CMPNN guided model learn topological relationship among elements in molecules and outperformed baseline methods in absorption, distribution and toxicity tasks. In absorption task, CMPNN achieved Root-Mean-Squared-Error (RMSE) s of 0.23 and 0.82 for the ESOL and FreeSolv datasets, respectively. In distribution task, CMPNN outperformed and achieved AUC-ROC of 0.963 for BBBP dataset, and also achieved AUC-ROC of 0.856 and 0.933 for Tox21 and ClinTox datasets in toxicity task. To learn better atom representations, there are also methods improve GNN in various ways: Gilmer et al. [147], Feinberg et al. [103], Yang et al. [102] used Gated Recurrent Unit (GRU) to improve long-range message propagation over the molecule graph. Geometric distance-based methods that successfully captured chemical properties in QM9 and MD17 datasets [152], [153]. Attention mechanism-based methods [105], [154] focused on element-level representation learning to obtain better molecule representation so they outperformed the baseline in ADT and made some interpretable visualizations for these tasks. MT-PotentialNet [54] not only propagated message differently depending on different edge types but learned a unified featurizer for multiple tasks using multi-task learning. Empirical experimental results achieved unprecedented accuracy in ADMET property prediction tasks and revealed that potentialNet with multi task learning on ADMET dataset not only interpolated but also extrapolated to new chemical space. Subgraph-level GNNs can overcome the limitation of atom/bond-level GNNs by explicitly utilizing subgraphs extracted from molecular structure. A subgraph can be conveniently defined as r-radius subgraphs [148], [149] or as domain knowledge guided functional groups [104], [150]. FraGAT [104] fragmented the graph by the unit of bonds. FraGAT cut all acyclic single bonds to make basic fragments, then iteratively connected them to form fragment-pair, which can regenerate original molecule after one bond-ligation reaction. FraGAT conducted experiments on 14 benchmark datasets including absorption, distribution and toxicity tasks. FraGAT outperformed baseline with RMSEs of 0.48 and 0.54 for ESOL and FreeSolv datasets in absorption task, and achieved AUC-ROC of 0.933 for BBBP dataset in distribution task. The AUC-ROCs of 0.863 and 0.969 for the Tox21 and ClinTox datasets in toxicity task also outperformed other methods. Experimental results showed that FraGAT can achieve the state-of-the-art predictive performance in most cases, especially in absorption tasks (e.g., FreeSolv) and toxicity tasks compared to AttentiveFP [155] with 0.736 RMSE for FreeSolv and 94% AUC-ROC for ClinTox dataset. All results demonstrated the effectiveness of using three-level hierarchical structural information of molecules in FraGAT. Similar to subgraph-level, graph-level GNNs are to model global structure of chemicals. For example, MGCN [156] achieved the best performance in predicting 11 out of 13 properties on QM9 dataset. This is because MGCN explicitly used hierarchical multi-level connection layers (point-wise, pair-wise, triple-wise, etc) to encode global interactions in chemical graph.

Reinforcement Learning

Reinforcement learning (RL) is a computational framework where trained agents or machine learning models make a sequence of decisions in an environment to achieve a specific goal. When the agent performs a certain action in its current state, the environment gives a computed reward to the agent for that action. RL learns policies to make the optimal choices that maximize total reward by exploring possible states and actions in a trial and error manner. Because of this learning framework of RL, it is widely used in various domains such as the game of GO [157], autonomous driving [158], and protein structure prediction [159]. Using RNN as an agent architecture, Olivecrona et al. [160] first adopted RL framework for generating SMILES string of molecules which are valid and bind to dopamine receptor type 2 (DRD2) receptor. Starting from a single atom SMILES string, the agent recursively determines the element and the direction of the atom to be attached; the reward is given whether the current molecular structure is active against DRD2, is chemically valid, and avoids sulfur. Iterative additions of atoms by the agent expands the scope of the chemical space, while the reward is navigating the agent not to be distracted to inactive or invalid chemical space. Ståhl et al. [161] improved the RL framework by incorporating substructural information and chemical properties such as Tanimoto structural similarity [162] and Levenshtein string distance [163]. However, due to the inherent linearity of SMILES, it is difficult to satisfy the Markov decision process (MDP) assumption in SMILES-based RL approaches. By defining a state as a form of graph, GCPN [164] fully overcame an invalid MDP assumption problem; because graphs can better capture molecular topology, chemical rules can be better reflected in transition dynamics. Although GCPN performed better than the SMILES-based RL models, exploration of chemical space is limited using a predefined set of scaffold subgraphs. By defining atom/bond-level actions, MolDQN [165] can explore non-restricted vast chemical space without scaffold subgraphs. In general, because the reward should be calculated from arbitrary molecules which the model generates, simple molecular properties such as logP and QED is often used. A recently published MoleGuLAR [166] used multi-objective scheme for generating drug like molecules with high binding affinity to novel targets along with desired logP: −6.76 kcal/mol mean binding affinity, 2.9 mean logP, 0.42 mean QED for targeting 2.5 logP and 1 QED, respectively. The authors described that their switching reward functions rather than the sum of rewards improved optimization quality because an alternating reward takes the model into the better local chemical space where one property is optimal when optimization for the conflicting property is started. REACTOR [167] generated average 77 actives for DRD2 objectives while following chemical reactions for generating ‘synthesizable’ derivatives - it outperformed previous like MolDQN [165] or JTVAE [168], which added additional synthesizability term as a negative reward and generated 9.667 and 4.0 actives, respectively.

Data Augmentation: How to Extend Our Knowledge on Chemical Space?

There are a number of databases that collect chemical compounds with various experimental results. Two main datasets are ChEMBL, ZINC in drug discovery. ChEMBL [24] is a database of molecules with bioactivity data. It contains 2.1 million compound information in the latest update. ZINC [169] is a database of commercially available molecules. It contains 1.3 billion compound information in v20. Although the number of compounds in existing databases seems large, they are quite small in the whole chemical space. Thus, we need more powerful computational techniques to explore chemical space. There are extensive reviews on building a chemical space by using a corpus of unlabeled chemical data for pre-training of deep learning architectures [170], [171], [172]. The main point in these deep learning-based methods for chemical space exploration is embedding where chemical compounds are represented in the alternative form of embedding vectors. Embedding is the mapping of high-dimensional feature vectors from the raw data space into a relatively low-dimensional space. To learn complex interactions between the raw features, deep learning models try to embed data using various architectures such as CNNs, RNNs, or GNNs. Because they learn potential characteristics of the data, embedding spaces have two advantages: (1) It is easier to compute similarities between data points or to identify common features that are difficult to capture in the raw data space. (2) The data is transformed into vectors regardless of its original format, allowing machine learning and deep learning models utilize the vector representations. A question arises here: How can we make a more effective embedding space given chemical tasks? More effective chemical embeddings should have good generalizability potential that allows the structural properties of compounds to be tailored for given benchmark tasks [173], [174]. In particular, most benchmark chemical datasets have a small number of labeled samples (e.g. 475 and 642 drugs in DILI and hydration free energy datasets, respectively), resulting in insufficient structural diversity only with the labeled samples. Thus, deep learning models employ various self-supervised learning and generative strategies into their framework to increase the structural diversity. In other words, deep learning is a search algorithm that makes exploration from the raw chemical space to an embedding space with desired properties. The purpose of the search algorithm is to explore from the start state to the goal state through intermediate states by transitions, where the raw and desired chemical spaces are a start and goal states, respectively. Here, we defined ‘transition’ as an optimized procedure by an objective function of deep learning models. A major obstacle to make efficient transitions is the lack of labeled data (Fig. 2 (a)). Various attempts have been made to address this problem, and this section focuses on Data Augmentation. Data augmentation is to provide additional data to deep learning models to help guide searches. If data becomes sufficient after data augmentation, the deep learning model can better achieve the goal state because the objective function can consider various aspects of the current embedding space. The main issue is how can we provide more data when the labeled data is not enough. To overcome the data insufficiency issue, there are three widely used approaches: (1) self-supervised learning, (2) generative learning, and (3) mixup.

Fig. 2

Approaches to address lack of data.

Approaches to address lack of data. All three methods are similar in that they provide additional data to the deep learning model, but each method has its own characteristics. (1) self-supervised learning (SSL), as the name suggests, is a method of generating labels from the data itself, so it can take advantage of many unlabeled chemical compounds. Recently, SSL has been spotlighted in various fields such as computer vision, natural language processing, and graph learning. It is a good technique for fine-tuning with small labeled data after constructing a uniform and distinguishable embedding space from a large number of unlabeled data (Fig. 2 (b)). (2) Generative learning is a method of creating new chemicals with similar properties from known chemicals (Fig. 2 (c)). Generative models such as generative adversarial network (GAN) and variational autoendocer (VAE) can be used, and data in various formats such as SMILES and graph can be generated and used for learning downstream tasks. (3) Mixup is a technique mainly used in supervised learning that mixes up two or more data representations and label information to create new data. This has the effect of interpolating the space in the sense of filling the space between the labeled data (Fig. 2 (d)).

Applications of Self-supervised Learning

Self-supervised learning (SSL) tries to learn the structural diversity and general semantics of unlabeled data to create an embedding space that can be used as an initial value in the process of fine-tuning. In particular, it is used to build embedding spaces to learn the semantic information of compounds in various computational forms such as SMILES or molecular graphs.

SSL with SMILES

SMILES-BERT [139] and ChemBERTa [95] utilized BERT (or transformer) architecture that is widely in the field of NLP for text data SSL for its outstanding performance. The two BERT models used ZINC [68] or PubChem [175] database as a pre-trained dataset, by masking a portion of tokens in each SMILES, the pre-training procedure predicts the masked symbols, which is to learn hidden semantics of the SMILES representation. The space created by the pre-trained BERT encoder partially reflects to the semantic information of the chemical compounds. Thus, the space serves as a useful intermediate state, and even with a small amount of label data, an informative goal state, i.e., the embedding space, can be constructed through a fine-tuning process.

SSL with Molecular Graph

Self-supervised learning using graphs has become recently more prevalent for chemical embeddings. We divide graph SSL into predictive and contrastive according to the loss function. Predictive methods generate labels related to data, and predict them using cross-entropy loss. Contrastive methods use InfoNCE or NT-Xent loss to determine the distance of positive and negative samples in the embedding space. Due to the nature of the loss functions, the predictive methods guide search towards constructing a more distinguishable embedding space and, in contrast, the contrastive methods construct a uniform embedding space. As examples of the predictive methods, [176] developed node attributes and context prediction tasks, called AttrMasking and ContextPred, respectively. These methods train an encoder that predicts the attribute and structural information of the graph to determine efficient intermediate states. Similarly, N-Gram-Graph [177] utilized the word2vec scheme to estimate node attributes. GPT-GNN [178] focused on node attributes prediction and global topology prediction tasks. On the other hand, the above methods are based on ”graph-driven labels” using the structural characteristics of the graph. There are also methods of generating ”knowledge-guided labels” based on the characteristics of the molecular graph. Grover [179] generated substructure-based labels based on the type and number of atoms/bonds in the k-hop neighborhood. MGSSL [180] performed pre-training as motif-level graph generation process, and it expects to build a more suitable chemical space for molecular graphs using functional motif-based subgraph information and generation process about the overall structure of the graph. MolGNet [181] and Kim et al. [182] tried to avoid negative transfer by designing a chemical space that can learn chemical validity to reflect chemical stability rather than specific properties. As examples of the contrastive methods, graph topology-based approaches perform contrastive scheme by same- or cross-view of the node/edge-, subgraph- or graph-level comparison (Table 3). For example, GCC [183] performed subgraph-to-subgraph contrast, which focuses on generalization of chemical space in terms of chemical subgraphs so that the latent representation well reflect molecular properties arising from functional groups. SUGAR [184] performed subgraph-to-graph contrast, which is to explore the interpretability and semantic connections between substructures and molecular graphs. In addition, graph-to-graph contrast methods [185], [186], [187] tried to learn semantics between the augmented graphs in the given dataset. On the other hand, MolCLR [98], MoCL [97], KCL [100], and MICRO-GRAPH [96] leveraged multi-level chemical knowledge where atoms, bond, subgraphs, or graphs can pose in developing chemical properties. The key to success of these methods is to focus on the semantic information shared by chemical graphs by knowledge guided augmentation that excavate meaningful subgraphs as well as embeddings.

Table 3

Graph learning methods for building a chemical space.

Data level	Year	Self-supervied Framework
		predictive	contrastive
Node/Edge-level	∼ 2019	EdgePred AttrMasking ContextPred N-Gram-Graph	Infomax
	2020	GPT-GNN	GRACE InfoGraph GMI
	2021	MGSSL MolGNet	GraphCL JOAO MolCLR MoCL
	2022
Subgraph-level	∼ 2019
	2020	Grover	GCC
	2021	MGSSL MolGNet	GraphCL JOAO GraphLoG Sugar MolCLR MoCL MICRO-Graph MolCLE
	2022
Graph-level	∼ 2019
	2020		InfoGraph
	2021		GraphLoG MoCL
	2022	D-SLA	KCL

Graph learning methods for building a chemical space. The key advantage of SSL is that downstream deep learning models can learn as more diversified molecular structures as possible even without provided labels. Thus, most graph-based pre-training methods demonstrated how much contribution their pre-training strategy contributed to downstream tasks. A seminal work introducing ‘ContextPred’ by Hu et al. [176] in most tasks achieved an average of 7.2% mean improvement in eight benchmark data sets by pre-training on ZINC15 database compared to the non-pretrained vanilla GIN model. Recent chemistry-inspired methods MolGNet [181] outperformed existing tools (mostly GNN methods) on both the classification tasks (SIDER, ClinTox, BACE, Tox21, and ToxCast) and the regression tasks (solubility, hydration free energy, and lipophilicity). MGSSL [180] also demonstrated the usefulness of chemistry guided pre-training strategy on the set of benchmark GNN models (GCN, GIN, RGCN, DAGNN, and GraphSage) by an average margin of 7.56% on eight different benchmark data sets.

Applications of Generative Learning

Generation of Molecules with Desired Properties

The purpose of generative learning is to generate new, unknown data with properties similar to the given data. The goal is to learn a latent distribution from the given data and then generate similar data from the distribution. For this, we can adopt the generator/discriminator or the encoder/decoder framework with CNN, RNN, and GNN. In particular, in the field of computer vision, various structures such as GAN [188], VAE [189], and RL [190] are used as generative models, and research on chemical generation is also in progress to address lack of data in specific tasks such as prediction of or logP. JT-VAE [168] utilized a VAE for molecule generation, which directly uses molecular graphs instead of SMILES. Given a molecular graph, it was converted into a junction tree format with a vocabulary of valid chemical substructures. Based on the junction tree, a VAE encoded the tree structure into a latent space, and decoded the input tree from the latent space. While converting the reconstructed tree into the molecular graph, JT-VAE guided the decoder using the graph embedding learned by a GNN encoder. Another work, Mol-CycleGAN [191] was a CycleGAN[192]-based model that generates new molecules with high structural similarity to the input molecules. Based on the latent space from JT-VAE, Mol-CycleGAN optimized the generator that learns desired chemical properties by discriminating two different molecule sets (e.g., active/inactive or high/low of ). Generative models explicitly sample modified molecules, then evaluates how much the generated molecule is optimized. Thus, GL is often used in molecular optimization tasks. The generated molecules are mostly evaluated by both synthesizability and numerical properties. A recently developed Mol-CycleGAN [191] and GCPN [164] represented VAE/GAN and RL methods, respectively. Under penalized logP optimization task of drug-like molecules with similarity constraints ( 0/0.2/0.4/0.6), Mol-CycleGAN outperformed GCPN in the mean improvement of the property. However, in terms of the success rate, Mol-CycleGAN has lower success rates for the more stringent constraints ( = 0.4, 0.6). Regardless of constraints, GCPN showed robust success rates and comparable improvements to Mol-CycleGAN in the stringent constraints. One of the latest research trends is 3D structure-based molecule generation. Since the late 2010s, several models have been proposed to discover novel compounds with target properties, including QM9 dataset-based quantum mechanics information. cG-SchNet [193], a conditional generative neural network for inverse design of molecules, enabled joint targeting of multiple properties including HOMO–LUMO gap and energy. Another 3D compound generative model, MOLGYM [194] constructed an RL environment for molecule design in Cartesian coordinates. Using rewards provided through fast quantum-chemical calculations, the agent was not only able to generate 3D molecules but also placed water molecules around a compound, predicting solvation state and separating inter-atom interactions from intra-atom forces.

Target-specific lead identification & optimization

In a drug discovery perspective, generating a compound with desired chemical properties is important, but its interaction with a desired protein target may be more valuable information. Several generative models have been proposed to meet the needs of lead identification and optimization using deep learning framework, based on given targets. GENTRL [67] was a deep generative model that developed potential DDR1 kinase inhibitors in 21 days. To guide the search, GENTRL utilized a VAE with a rich prior distribution in the latent space. The rich prior distribution was obtained by tensor train decomposition with chemical properties, including MCE-18, , and a binary indicator of passing medicinal chemistry filters (MCFs). Then, with the trained encoder, GENTRL learned the generation process of DDR1 kinase inhibitors by using RL framework with kinase related reward functions. MORLD [195], a docking score reward-based reinforcement learning framework, generated and optimized lead compounds fit to query protein structure without intense screening on large bioassay libraries. The authors claimed that their proposed method speed up the DDR1 kinase inhibitor discovery time from 21 days by GENTRIL down to 2 by their method. An interesting model proposed by Méndez-Lucio et al. [196] is a two-staged GAN-based model that generates molecules that fit the desired gene expression profiles. The discriminator calculates the probability of whether a generated molecule is a valid molecule, and a conditional neural network predicts whether the molecule fits the given desired expression profile.

Applications of Mixup

In supervised learning, when data is insufficient, decision boundaries can be constructed too tightly, which leads to overfitting the training data. Mixup [197], [198] performs interpolation of both input data and label information to smooth decision boundaries and infer information between the boundaries. Let x is a input data and y is a input label. Mixup of two data generates new data and , where and is mixing ratio. is a function that maps the input x into the latent space to interpolate. For example, is input mixup [199], and is manifold mixup [200]. Then, the deep learning model is to be trained to learn the mixed data and predict the corresponding mixed labels, rather than original class labels. As of now, researches on mixup are mainly in the field of images. This is because the mixing technique is suitable for grid-structured data and the labels of the interpolated data may not be smooth. For example, SMILES augmented by mixing two SMILES may exhibit neither of the two chemicals. Also, irregular graph sizes and connectivity of graph are major challenge of graph-level mixup. Even so, some studies for graph-level mixup have been conducted recently. Wang et al. [197] suggested two mixup schemes for node- and graph-level classification. Node-level mixup scheme consists of two-branch graph convolution and two-stage mixup framework for considering receptive field of nodes and preventing unintended mixed representations. Graph-level mixup is performed in the embedding space of graph representations, which is equivalent to the manifold mixup. Graph Transplant [198] also suggested graph-level mixup based on meaningful subgraphs related to labels. To obtain the informative subgraphs, it utilizes node saliency information and adaptively determines the labels.

Additional Features Required Beyond Chemical Compound Information

In addition to feature engineering by deep learning methods from molecular structures, features that cannot be directly inferred from the molecular structures can provide structurally diversified chemical information make tasks-specific chemical spaces [201], [202], [203]. In this section, we discuss two major features, molecular descriptors and pharmacogenomics profiles. First, chemical heuristics are arranged as a general feature set. While ECFP fingerprints solely produce subgraph-based binary information, PubChem fingerprints, a 881-long bit vector, include rules such as the number of aromatic rings, or the number of unsaturated bonds [79]. Second, recent advances in gene expression measurement technologies, especially next generation sequencing, can provide multi-level and extensive features in addition to chemical features [204]. For example, in terms of pharmacogenomics, RNA sequencing measures more than 20,000 genes in human samples. Molecular Descriptors. logP (partition coefficient) and TPSA (total polar surface area) are related to the solubility of a compound in aqueous solutions and the presence of specific structural features such as the number of rings is related to carcinogenesis [205]. TOP [112] leveraged logP, molecular weight, and TPSA to the independent fully connected layers on the word embeddings of SMILES along with biGRU to learn chemical structures. Addition of physico-chemical properties selected by genetic algorithm to biGRU featurization provided 0.195 of improvement in AUC-ROC to toxicity prediction tasks. Tharwat et al. [206] used 31 molecular descriptors in prediction of four toxicity tasks in combination with multiple data sampling strategies to build an ensemble learning framework. In this way, they achieved the best performance on the four different toxicity tasks by an entropy-based feature selection method [207]. Leveraging chemical fingerprints also provided additional improvement to the text-based modeling of chemical compounds in predictions of solubility, hydration free energy and lipophilicity [106]. Gene Expression Profiles. Gene expression profile is also an important information to featurize chemical structures as the metabolic dynamics of the drugs in biological systems can be inferred from the changes in gene expression profiles [51], [208], [209]. In pharmacogenomics, perturbations in gene expression can also elicit the mechanistic clue of toxicity as a responsive to drug administration [210], [211], [212], [213]. In a recent study, a model is also being developed that uses data other than chemistry to generate a desired chemical compound. In particular, gene expression profile was used to generate a candidate small-molecule drug for cancer [196], [214]. Biological Assays. The purpose of bioassay experiments on various drugs and target organisms can be used to narrow down the potential drug targets. From the computational perspective, these experiments can be considered as drug-target interaction (DTI) problem. DTI information can be explicitly fed into ML models that predict cellular responses upon drug treatment [215], [216]. Besides, due to experimental costs and limitations, DTI itself is one of the most active research areas in the machine learning communities [217].

Discussions

In this section, we discuss current paradigms and future directions for deep learning based chemical embeddings.

Graph-based Chemical Embeddings

As shown in Table S1, most of the recent works try to utilize molecular graphs as is, as opposed to the linear representation like SMILES. Graph-based chemical embedding methods have advantages on four aspects: (1) It is natural to represent chemical compounds as graphs. Since atoms and bonds are represented by nodes and edges of the graph, interaction information of compounds can be efficiently reflected. (2) Because graphs can model local- and long-range interactions, we can model rich information from structural motif [104], [150] to global proprieties of the chemicals [147], [156]. (3) Depending on the goal of the task, we can learn the molecular graphs using a variety of techniques including RNN [106], GNN [153], Transformer [179], and so on. (4) Domain knowledge also can be expressed in the form of graphs [218], thus it can be used as prior knowledge of the model to construct the effective embedding space. Because of these advantages, graph-based chemical embedding is an interesting research direction, but it may need more time and efforts because intrinsic properties of compounds, such as the spatial location of atoms, are yet to be well characterized.

Exploration on Motif-level Learning

Substructures, also known as motifs, are fundamental components that determine characteristics or properties of a compound [219]. To reflect this into machine learning heuristics, as described in Section 4.1, recent studies consider molecules as a set/tree of substructures. While the previous approaches use set of k-hop neighbors (local-level) in a graph to make a graph-level representation, recent approaches put more efforts into leveraging chemical prior knowledge. Predictive models including Grover [179], MGSSL [180], and MolGNet [181] tried to learn the chemistry level semantics of a chemistry graph to effectively construct an entire graph from combinations of subgraphs. In case of contrastive methods, data augmentation strategy is a crucial issue. Rather than atom/bond-level add/deletion schemes, motif-based data augmentation makes it easier to create a valid chemical and retain the properties of a given original compound. Meanwhile, the rich information of motifs is also useful in chemical generation. As shown in Zafirlukast’s example for lead compound optimization [220], fatty acid mimetics is one of the most widely used techniques for lead optimization in medicinal chemistry. Deep generative models may also support this effectiveness. JT-VAE [168] is a seminal approach to effectively reconstruct tree-like structure of molecular graphs, while successive efforts have been made by other studies to reflect chemical prior into their policy network [164]. Likewise, substructure based chemical modification by RL on molecular graphs will be helpful to find a desired chemical compound in a chemical space with guaranteed stability that comes from a structure similar to the existing fatty acid-like antagonist [221]. As such motif-level learning methods are still premature, how to investigate the power of motifs remains a challenging and attractive problem.

Pre-training for chemical space

Pre-training through self-supervised learning or generative models has been increasingly important to compensate for insufficient chemistry data for specific tasks. Building a chemical embedding space using pre-trained methods has the following advantages: (1) As pre-training methods are developed in various domains including computer vision or natural language processing, we can use various data formats, such as graphs [97], [98] or SMILES strings [95], [139] for pre-training. (2) It uses unlabeled data, but learns a lot of information about the chemical compounds. Structural diversity grows exponentially when learning a large amount of chemical data, even though the label information is not given. It increases the generalizability of the trained model and achieves good performance in a variety of downstream tasks. (3) Even on small labeled datasets, it is possible to train task-specific embedding spaces using fine-tuning or few-shot learning techniques [222]. However, the following disadvantages should also be kept in mind: (1) Little or no correlation between pre-trained tasks and downstream tasks can result in negative transfer that degrades model performances [176]. (2) The lack of a theoretical foundation for pre-training techniques can make it difficult to interpret which properties of the compound have been learned by the generated embedding space. Future studies will require more descriptive and robust pre-training methods to compensate for this point.

Importance of using Negative Data

The selection of negative data is an important issue. It plays a crucial role in determining decision boundaries in the embedding space. For example, to accurately and rapidly generate DDR1 kinase inhibitors, GENTRL [67] utilized molecules that act on non-kinase targets as negative data. In the mode-of-action, [223] improved in silico target prediction by utilizing negative bioactivity data held in chemogenomic repositories. For drug-target interactions, [224] proposed a systematic method to select reliable negative samples, which significantly improved the prediction accuracy for protein targets of small molecule drugs. From a technical point of view, self-supervised learning is a useful technique for building pre-trained models, but it is difficult to select useful negative data for using unlabeled compounds. For contrastive learning, the concept of attract/repel between anchor data to positive/negative data is important, so the selection of negative data is very important. We expect that new techniques will develop and use more efficient negative data selection strategies, which results in more effectively navigation of chemical space.

Potential Risk of Overfitting

Due to the lack of sufficient data labels in chemical benchmarks, deep learning models are prone to overfitting because deep learning models use a large number of parameters [225], [226], [227]. It is not possible to avoid the overfitting issue in chemical property prediction tasks, but deep learning models have been trying to address the overfitting issue in two different perspectives: data and computational perspectives. Computational perspective. In terms of DL models using a large number of trainable parameters, major achievement in reducing overfitting was made by stochastically dropping out the trained weights on randomly selected neurons [225] or Bayesian approaches [228], [229]. To fulfill the out-of-sample generalizability of ML models on independent data, train/valid/test data splitting strategies are introduced in terms of scaffold/random/temporal features, etc [11]. Among the splits, it is reported that scaffold/temporal splits produce less bias compared to the random split data preparations [102], [230]. In the meantime, TDC benchmark recommends different types of data splitting, performance measure, and modeling strategies on benchmark data sets for fair comparisons [12]. Data perspective. When it comes to HTS BioAssay data, there has been a paper (LIT-PCBA [231]) that developed an unbiased PCBA data set with reduced number of protein targets (p = 15) to avoid potential overestimation by machine learning models. In a recent paper (FP-GNN; [232]), FP-GNN was compared to ML models (NB, SVM, RF and XGBoost) and DL models (DNN, GCN and GAT) on the LIT-PCBA dataset. Given mixed fingerprints (MACCS, PubChem, and Pharmacophore ErG) as input features, the four ML models achieved on average 0.672 accuracy while the three DL models (DNN, GCN, and GAT) and FP-GNN achieved improvement in accuracy to 0.729 and 0.739, respectively. It seems that deep learning models have been evolved to improve on the overfitting issue.

Conclusion

In this article, we surveyed how deep learning technologies can model and utilize chemical compound information in a task-oriented way by utilizing annotated properties and assay data in the chemical compounds databases. We first compiled what kind of tasks are tried to be accomplished by machine learning methods (Section 2). Then, we surveyed deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Section 3 surveyed deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. In Section 5, we surveyed what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Final section surveyed four important newly developed techniques that are yet to be fully incorporated into computational analysis of chemical information. Recently developed deep learning methods has demonstrated their ability to increase the efficiency of lead compound optimization, and various self-supervised graph learning methods are also being developed based on databases such as ZINC to address the problem of the insufficiency of labeled data. By incorporating newly developed techniques, deep learning models can be more powerful to explore chemical space in search of new compounds or new properties of existing compounds, which can accelerate drug discovery process.

CRediT authorship contribution statement

Sangsoo Lim: Conceptualization, Investigation, Writing - original draft. Sangseon Lee: Conceptualization, Investigation, Writing - original draft. Yinhua Piao: Investigation, Writing - original draft. MinGyu Choi: Investigation, Writing - original draft. Dongmin Bang: Investigation. Jeonghyeon Gu: Investigation. Sun Kim: Conceptualization, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

143 in total

1. Extended-connectivity fingerprints.

Authors: David Rogers; Mathew Hahn
Journal: J Chem Inf Model Date: 2010-05-24 Impact factor: 4.956

2. Graph convolutional networks for computational drug development and discovery.

Authors: Mengying Sun; Sendong Zhao; Coryandar Gilvary; Olivier Elemento; Jiayu Zhou; Fei Wang
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

3. Prediction of Liquid Chromatographic Retention Time with Graph Neural Networks to Assist in Small Molecule Identification.

Authors: Qiong Yang; Hongchao Ji; Hongmei Lu; Zhimin Zhang
Journal: Anal Chem Date: 2021-01-07 Impact factor: 6.986

4. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors: Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal: J Chem Inf Model Date: 2012-11-01 Impact factor: 4.956

Review 5. Opportunities and Challenges for Fatty Acid Mimetics in Drug Discovery.

Authors: Ewgenij Proschak; Pascal Heitel; Lena Kalinowsky; Daniel Merk
Journal: J Med Chem Date: 2017-03-27 Impact factor: 7.446

6. A general optimization protocol for molecular property prediction using a deep learning network.

Authors: Jen-Hao Chen; Yufeng Jane Tseng
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

7. Optimization of Molecules via Deep Reinforcement Learning.

Authors: Zhenpeng Zhou; Steven Kearnes; Li Li; Richard N Zare; Patrick Riley
Journal: Sci Rep Date: 2019-07-24 Impact factor: 4.379

8. Autonomous molecule generation using reinforcement learning and docking to develop potential novel inhibitors.

Authors: Woosung Jeon; Dongsup Kim
Journal: Sci Rep Date: 2020-12-16 Impact factor: 4.379

9. PubChem 2019 update: improved access to chemical data.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971