Literature DB >> 34955617

Artificial intelligence: machine learning for chemical sciences.

Akshaya Karthikeyan¹, U Deva Priyakumar¹.

Abstract

Research in molecular sciences witnessed the rise and fall of Artificial Intelligence (AI)/ Machine Learning (ML) methods, especially artificial neural networks, few decades ago. However, we see a major resurgence in the use of modern ML methods in scientific research during the last few years. These methods have had phenomenal success in the areas of computer vision, speech recognition, natural language processing (NLP), etc. This has inspired chemists and biologists to apply these algorithms to problems in natural sciences. Availability of high performance Graphics Processing Unit (GPU) accelerators, large datasets, new algorithms, and libraries has enabled this surge. ML algorithms have successfully been applied to various domains in molecular sciences by providing much faster and sometimes more accurate solutions compared to traditional methods like Quantum Mechanical (QM) calculations, Density Functional Theory (DFT) or Molecular Mechanics (MM) based methods, etc. Some of the areas where the potential of ML methods are shown to be effective are in drug design, prediction of high-level quantum mechanical energies, molecular design, molecular dynamics materials, and retrosynthesis of organic compounds, etc. This article intends to conceptually introduce various modern ML methods and their relevance and applications in computational natural sciences. Synopsis Recent surge in the application of machine learning (ML) methods in fundamental sciences has led to a perspective that these methods may become important tools in chemical science. This perspective provides an overview of the modern ML methods and their successful applications in chemistry during the last few years. © Indian Academy of Sciences 2021.

Entities: Chemical

Keywords: Deep learning; computational chemistry; computational materials; drug design; machine learning; molecular design; neural networks

Year: 2021 PMID： 34955617 PMCID： PMC8691161 DOI： 10.1007/s12039-021-01995-2

Source DB: PubMed Journal: J Chem Sci (Bangalore) ISSN： 0253-4134

Introduction

The application of ML methods to problems in natural sciences started few decades ago. The first publication in this area was by Hiller et al. in 1973, which used a three-layer perceptron network for the classification of substituted 1,3-dioxanes as pharmacologically active and inactive.[1] From the 1990s, the use of artificial neural networks (ANNs) was prevalent in computer aided drug design, especially in quantitative structure-activity relationship (QSAR) studies.[2] However, application of ML methods to other areas of scientific research remained a niche domain without much attention until recently.[3] Experiment, theory and computation are recognized as the three cornerstones on which scientific advances are made. The advent of new deep learning (DL) algorithms, along with new datasets, libraries, and better computing infrastructure, has fueled data–driven methods as the fourth paradigm. Figure 1 shows the number of publications with “machine learning” in the abstract according to American Chemical Society (ACS) Journals through the years. It shows ML has grown at a remarkable rate in the past four years as one of the most popular research directions.

Figure 1

The rise of machine learning over the years evident from the number of publications American Chemical Society journals with “machine learning” anywhere in the article.

The rise of machine learning over the years evident from the number of publications American Chemical Society journals with “machine learning” anywhere in the article. An extreme view on AI/ML is that it “has made huge progress in perception”. The immense hype around it has attracted the attention of people from all walks of science and technology. Below is a recent example of how modern ML methods have made a high impact on one of the holy grails of biological research - protein structure modeling from its primary sequence. Critical assessment of protein structure prediction (CASP) is a competition that is conducted once in two years since 1994, where research teams from around the world attempt to predict three-dimensional structures of proteins from just the amino acid sequences. Proteins, whose structures are almost solved, or whose structures have been recently solved but are withheld from the public, are taken up in these competitions. The most recent and the 14th edition of this occurred in November 2020.[4] By comparing the computational predictions with the lab results, each CASP14 competitor received a global distance test (GDT) score. GDT is a structure similarity measure for comparing protein folds. One of the competitors - a company called DeepMind, outperformed others by a huge margin. DeepMind’s AlphaFold 2 produced models for about two-thirds of the CASP14 target proteins with GDT scores above 90, indicating that the models are considered roughly equivalent to experimental methods. AlphaFold 2 is so highly accurate that many have hailed it as the solution to the long-standing protein structure prediction problem.[4-6] Such a huge difference between the performances of Deepmind and others was primarily due to the engineering aspects of the ML algorithms used.[7] This is one of the many successes of the modern ML methods and is just one example of how these algorithms along with physics based methods may impact the nature of scientific computing in the years to come. The rest of the article is structured in the following manner. Initially, a short overview of different types of molecular representations and datasets is presented. Then, selected ML methods are discussed at the conceptual level. This is followed by brief discussions of a few popular areas of molecular sciences where ML has found success. Finally, the challenges faced by ML in molecular sciences are analyzed and there is also a discussion on how this area may evolve in general.

The role of ML in AI

The definitions of AI, ML, and DL have changed over the years, and their correlations have also evolved. Conventionally, AI is a general area which can loosely be termed as a class of techniques that enable computers to mimic human intelligence. Recently, AI systems have performed as well as, or even better than humans in several tasks.[8] AI, and its most common subfield of ML, study the methods of enabling machines to skillfully perform intelligent tasks without explicitly being programmed for those tasks. Today, in its various forms, AI is successfully applied across various domains ranging from robotics and image analysis to its application in molecular sciences. Most researchers today agree that one of the primary requirements for intelligent behavior is learning. This makes ML one of the most rapidly developing subfields of AI. Nowadays, it is being argued that ML has outgrown its parent. DL and Reinforcement Learning (RL) are subcategories of ML that have recently developed in the field. Figure 2 shows the schematic of the conventional relationship between the categories.

Figure 2

Schematic of the conventional relationship between artificial intelligence (AI), machine learning, deep learning and reinforcement learning.

Machine learning

Within AI, ML has emerged as the method of choice for developing practical software for machine translation, speech recognition, computer vision, recommendation systems and other applications.[9,10] ML, which includes DL, relies on statistical methods to learn from data. Using these techniques, we can extract complex and often hidden patterns from given data sets and can express them as mathematical objects. Many of the AI system developers now agree that, for many tasks, it can be far simpler to train a system by showing it examples of desired input-output behavior than to program it manually.

Deep learning

Traditional ML is limited to the size of its input data. For example, thousands of pixels will be sent to the system for analyzing images of conventional size. This means that reception and grouping of information to select those which are essential to the task will be necessary. DL is capable of handling such problems. It uses multi-layered neural networks, extremely large amounts of data and computing time to make accurate predictions. Unlike ML, it is not necessary to hand-engineer features (discussed later) from the raw data in DL. Function specification (defining what to learn from the given data) and optimization (how to weigh the data appropriately) are taken care of by the algorithm itself that has made DL extremely popular in many fields such as speech recognition,[11] computer vision,[12] NLP,[13] and recently in molecular sciences.

Chemical representations and descriptors

Chemical representations

Traditionally, molecules are depicted as structure diagrams with bonds and atoms. However, other representations are required for the computational processing of chemical structures. Chemical representation of a molecule may contain its spatial or topological information in a computer-interpretable format.[14,25] Current representations can be broadly classified into three types: discrete (e.g., text), continuous (e.g., vectors and tensors) and weighted graphs. Atomic coordinates, graph representations, simplified molecular-input line-entry system (SMILES) and international chemical identifier (InChl) are some of the popular representation methods. A molecular graph representation essentially maps the atoms and bonds in a molecule to sets of nodes and edges respectively. It’s formally a 2D matrix that can be used to represent 3D information like atomic coordinates and bond angles. A simple example is representing molecules in the form of an adjacency matrix A, where aij = 1 means there exists a bond between nodes vi and vj in the molecular graph, and aij = 0 means otherwise. However, the matrices by which molecules are described are not compact as they scale as the square of the number of atoms. This is not a problem with linear notations like SMILES and InChI. SMILES is used to translate a chemical’s 3D structure into a string of symbols based on a set of rules. It’s like a connection table (Ctab) which identifies the nodes and edges of a molecular graph. Another form of line notation, InChI, is a hierarchical layered notation where each new layer describes more complex chemical characteristics. The first few layers include information within the connection table, and the additional layers (if needed) deal with complexities like isomers and isotopic distributions. The InChI provides a unique identifier, while SMILES is commonly used for storage and interchange of chemical structures. The common classification method of molecular descriptors.

Molecular descriptors and fingerprints

Using algorithms, the physical and chemical information encoded within the symbolic representations of molecules are transformed into useful mathematical representations, known as molecular descriptors or feature vectors.[15,16] Efforts have been made to define the criteria for developing efficient descriptors: they need to be interpretable, invariant to the symmetries of the underlying physics, direct and concise to avoid redundancy and the curse of dimensionality. Molecular descriptors can be experimental values like density, logP, dipolemoment and so on. They are used for various tasks like finding quantitative structure-property relationships (QSPRs) and QSARs, virtual screening (VS), and similarity searching. This is because molecules with similar properties tend to have similar descriptors.[15,17,18] Molecular descriptors can have a significant impact on the performance of ML models based on how they capture the relevant features for the specific task. In 2013, Hansen et al.[19] improved their method of predicting atomization energies of organic molecules largely by modifying the representation used. By using variations of the Coulomb matrix (the representation used for the previous state-of-the-art model), they were the first to highlight the importance of good data representation in QM tasks. Molecular descriptors are commonly categorized as 0D (0-Dimensional), 1D, 2D, 3D and 4D descriptors (Figure 3).[17] The 0D descriptors contain no information about the molecular structure, like atom and bond counts. 1D descriptors contain information obtained from the molecular formula, like molecular fingerprints. Molecular fingerprints encode the structural features of molecules in a binary bit string format. Circular fingerprints, based on the Morgan algorithm,[20] encode which substructures are present in a molecule.[21,22] One of the most common circular molecular fingerprints, extended-connectivity fingerprints (ECFPs),[23] are often used in QSAR models for lead optimization. A new molecular fingerprint called MinHashed atom-pair fingerprint, up to a diameter of four bonds (MAP4), is suitable for small to large molecules and can be adopted as a universal fingerprint.[24]

Figure 3

The common classification method of molecular descriptors.

2D descriptors contain information concerning the size, configuration, and/or electronic distribution of molecules. These include variants of molecular graph representation[25] and CM. CM is a square matrix (atom by atom) that encodes the atomic nuclear charges (Z) and cartesian coordinates (R) of the atoms:where is the nuclear charge, and is the nuclear radius of . Equation (1) corresponds to the approximate electronic potential energies of a free atom and Eqn. (2) corresponds to the coulomb nuclear repulsion terms. 3D descriptors usually depend on the 3D conformation of the molecule, like van der Waals volume and WHIM descriptors.[26] 4D descriptors are usually obtained through reference grids and molecular dynamics (MD) simulations. Other examples of molecular featurization include Bag of Bonds (BoB)[27] and BAND[28] descriptor. BoB[27] can be seen as a histogram vector where each unit, called a “bag” counts the number of times a particular bond (such as C-O, C-H, etc.) appears. Like CM, a bag contains internuclear Coulomb repulsion between the atoms involved. In 2019, Laghuvarapu, Pathak, and Priyakumar[28] proposed BAND neural network for predicting atomization energies based on a chemically intuitive representation that captures the essence of molecular mechanics (MM) force fields. The BAND descriptor is computed as the sum of energy contributions from bonds (B), angles (A), nonbonds (N), and dihedrals (D).

Molecular datasets

The performance of ML models heavily depends on the increasing availability and quality of data. One of the challenges of using ML is getting the right data in the appropriate format. Getting the right data involves gathering information, which contains signals that correlate with the outcomes of the task. For example, information on NMR spectrum of molecules won’t help in solvation energy prediction. High-quality datasets are usually difficult and expensive to create, and supervised learning (discussed later) also requires a significant amount of time to label the data. The first ML algorithms for molecular modeling in 2010–2012 relied on small datasets having quantum mechanical (QM) properties for 102–103 molecular systems.[29-31] The chemical compound space (CCS) is estimated to consist an order of 1060-10100 molecular systems.[32,33] In the last decade, increasingly larger chemical spaces were built and explored. Large scale QM and MD methods, along with advances in high throughput experiments, are generating data at an incredible rate. Today, DL models are capable of predicting chemical properties with reasonable accuracy by analyzing under just 5% of large molecular datasets. Such data efficiency and quality are crucial for in-silico chemical discovery. Most studies applying ML for predicting QM properties, like atomization energy, use either QM7(b) dataset or its larger version QM9.[34,35] Both are subsets of the combinatorially generated molecular library GDB, which include over 109 stable organic compounds and up to 17 heavy atoms,[36] which essentially covers all small drug-like molecules. Other datasets are used in various ML problems such as predicting drug-target affinity (like Kiba[37] and Davis[38]), solvation energy (like FreeSolv[39] and MNSol[40]), spectrum prediction (like NMRShiftDB[41]), molecule generation (like MOSES[42]) and for many other tasks in molecular sciences. Datasets such as ZINC and ChEMBL include over 108 drug-like molecules for studying problems like ligand discovery. PubChem, a database of over 108 chemical substances and their activities,[43,44] is used in the fields of, among others, VS, drug repurposing, drug side effect prediction, chemical toxicity prediction and metabolite identification.

ML approaches in molecular sciences

ML algorithms have successfully been applied to various domains in molecular sciences to obtain faster and more accurate solutions when compared to traditional methods (like QM calculations, DFT or MM-based methods, etc.). The relationship between a molecular structure and its properties is largely deterministic.[45] ML models take advantage of this through their flexibility (e.g. universal approximation theorem for ANNs) and learn the underlying QSPRs of a problem, even from simple chemical representations.[46] ML approaches can be classified based on various standards. One method of classification is based on whether the ML system needs human supervision. Based on this, ML approaches are broadly categorized into three types: supervised, unsupervised, and reinforcement learning (Figure 4). This section presents a brief account of selected popular ML methods that have been used to tackle molecular science problems.

Figure 4

Examples of various machine learning approaches and algorithms.

Supervised learning

The most widely used ML methods are supervised.[47] Molecular property predictions usually fall into this category. Supervised learning is the process of learning a function that maps an input to an output based on input–output pairs labelled by humans. The algorithms aim to minimize the errors pointed out during the learning process. It can extract complex nonlinear patterns and is superior to manually programmed traditional models. The most basic algorithm is linear regression, which is expressed aswhere x is the feature vector, is the hypothesis function (mathematical formula to model a problem), and is the model’s parameter vector with a bias term. The following sections briefly present examples of supervised algorithms applied to various molecular science tasks. 5.1a Traditional ML methods: Traditional ML methods can loosely be said to encompass fundamental algorithms that are often the foundation for more cutting–edge ML. Traditional algorithms are of several types: kernel based methods (like SVMs), decision tree methods (like Random Forests and XGBoost), Bayesian methods, etc. These algorithms can be used to solve classification and regression problems. For example, molecular property prediction is a regression problem where algorithms such as Kernel Ridge Regression (KRR),[27,48,49] Random Forests,[50,51] and Elastic Net[52] have been employed. The performance of traditional ML methods and neural networks with respect to dataset size. Although they have been successfully applied in various fields, traditional models rely on hand-engineered molecular descriptors from the symbolic representation of molecules, which requires domain expertise. Some ML approaches utilize experimental measurements such as physico-chemical properties as descriptors, but the cost of obtaining such optimized descriptors is the bottleneck. Deep neural networks (DNNs) are capable of automatic feature extraction and greatly outperforms traditional methods when it come to dealing with large datasets of complex problems. However, traditional ML methods are still preferred over DNNs if the dataset size is small, as DNNs tend to overfit. The performance of these methods with respect to dataset size is shown in Figure 5. Often, traditional models are conceptually simpler. Most DNNs work like a “black-box”, which is a big limitation in fundamental science where uncertainty measures and interpretability are desired.

Figure 5

The performance of traditional ML methods and neural networks with respect to dataset size.

The structure of an Artificial Neuron. 5.1b Artificial Neural Networks (ANNs): ANNs (also known as perceptrons), which are similar to the biological neural networks,[53,54] are one of the most widely applied models in computational studies. ANN can be thought of as transforming the input x into a new feature space, in which it becomes correlated with the output y. When ANNs transform features sequentially through several layers, it is referred to as DNNs. They are excellent tools for identifying patterns and correlations which are far too complex or numerous for a human to extract and manually program. Each layer consists of one or more artificial neurons (Figure 6). These neurons calculate the weighted sum of the outputs from their preceding neurons and add a bias. Before passing their output to the succeeding neurons, an activation function is used to decide if the value should be “activated” or not. Since the value can range from to , the type of activation function required is chosen depending on the task. For example, Rectified Linear Unit (ReLU) is an activation function that gives an output x if x is positive and 0 otherwise, and it can be employed in large neural networks for sparsity.

Figure 6

The structure of an Artificial Neuron.

When a neuron contributes to predicting the correct results, the connections associated with it are strengthened, i.e., updated weight values are higher. During feed-forward training, the output of each neuron till the last layer is calculated. After the process, the differences between the predicted and the target outputs are compared to find each neuron’s contribution to the errors. A numerical optimization technique called gradient descent is used to update the weight values by backpropagating the errors to the input layer. The learning algorithm is typically represented as:where is the input, is the target value of the output, is the predicted value, is the weight between input and output, n is the step, and is the learning rate. The learning rate is chosen such that the model training can converge in a reasonable time. DNNs learn high-level features from data incrementally, with each additional hidden layer capturing higher level features than the previous layer. This eliminates the need for domain expertise and manual feature extraction. Thus, DNNs can automatically learn to extract useful molecular descriptors best suited for the given data. However, since features have to be learned from scratch for every new dataset, these methods can lead to overfitting with limited data. The most basic type of ANN is a feedforward neural network, in which information travels in only one direction from input to output. There are a variety of others like recurrent neural networks (RNNs), CNNs, etc. 5.1c Recurrent neural networks (RNNs): While training vanilla ANNs, each iteration doesn’t remember what it processed in the previous iteration. This is a disadvantage when it comes to identifying patterns and correlations in sequential data, for example, amino acid sequence of proteins. RNNs are ANN architectures capable of remembering data and modelling short-term dependencies due to its recurrent memory cells and are popularly used in sequence modeling and generation. The RNN cell retains the knowledge of what the model saw in the previous time-step when processing the current time-step’s information, which may affect the interpretation of the current one. Figure 7 shows a basic pipeline of an RNN sequentially generating molecules via SMILES. The output of each RNN cell is fed as input to the next RNN cell. The cells also pass their shared weights that capture the past information in the sequence. Concatenating all the outputs create the completed SMILES for a newly generated molecule.

Figure 7

Recurrent Neural Network for sequentially generating molecules via SMILES.

When training basic RNNs to predict long-term dependencies, the gradient shrinks or explodes as it backpropagates through time - the vanishing and exploding gradient problems.[55,56] This prevents RNNs from learning these features from long sequences. A type of RNN unit, the long short term memory (LSTM) unit or its variant called the gated recurrent unit (GRU), contains “gates” which lessen these gradient problems. These gates decide how much to remember from its past, what to include in its current state, and what to pass on as output to the next gate. The gradients can now be preserved for longer sequences. LSTMs and GRUs are popularly used for inverse molecular design as molecular representations such as SMILES have long-term dependencies like closing parenthesis and rings. For generating molecules using SMILES, the output layer usually gives probabilities for every possible SMILES string token and not the character itself because of these strict long-term dependencies. Typically in generative mode, the method is to sample this distribution, while in training mode, the token with the highest probability is chosen. Recurrent Neural Network for sequentially generating molecules via SMILES.

Unsupervised

Unlike supervised learning, unsupervised learning is the process of learning without labelled data. Instead of picking out specific types of data that are predefined as desired, it simply looks for data that can be grouped based on their similarities. This is why it is also called clustering or grouping. The system is trained using large data and it learns by itself. The following section presents a few examples of unsupervised learning for different tasks. 5.2a Autoencoders (AEs): Studies have aimed to derive molecular descriptors in an unsupervised and data-driven way. In 2016, Gomez-Bombarelli et al.[57] created the first ML–based generative model for molecules called CharacterVAE. The model also delivered a data-driven method for molecular descriptors. They developed a variational autoencoder (VAE) to convert the discrete SMILES representation of a molecule to and from a continuous multidimensional representation. An AE is an ANN architecture for unsupervised feature extraction. It consists of an encoder, a decoder, and a distance function. The encoder compresses the input into a lower-dimensional fixed vector (latent representation), then the decoder reconstructs the vector back into the input. A distance function determines the difference between the original input and the reconstructed output. The objective of the training is to minimize the information loss of the reconstruction. If the input is the chemical representation of a molecule, the bottleneck vector between the networks forces the essential information of the molecule to get compressed, so that the decoder makes as few errors as possible in the reconstruction. If the compressed vector captures all the necessary information of the given molecule to accurately reconstruct the original chemical representation, it may also capture more general chemical information about the molecule. This idea could be used to acquire molecular descriptors for property prediction ML models. (a) An AE encodes the molecules into a feature space and decodes them back (b) A VAE encodes the molecules into the latent space, which is a continuous numerical representation. Vanilla AEs are however not employed for de novo drug design as it is not capable of learning a generalized representation of the molecules. The valid molecules lie on a continuous manifold of functionality, but due to the large number of NN parameters and the relatively small number of training data, it is possible that the AE learns some explicit (non-continuous) mapping of the training set. Thus, the latent space learnt may contain large “dead areas”, and the decoder will not be able to decode valid SMILES in the continuous space. VAEs generalise AEs and are capable of forming continuous latent spaces. The model is restricted to learning a latent variable from its input distribution, usually the mean and variance (Figure 8). The restriction encourages all areas of the latent space to correspond to the decoding of valid molecules. When VAEs are trained to reproduce molecules and properties together, the latent space reorganizes in a way that molecules with similar properties are nearby each other.[58,59]

Figure 8

(a) An AE encodes the molecules into a feature space and decodes them back (b) A VAE encodes the molecules into the latent space, which is a continuous numerical representation.

5.2b Generative adversarial networks (GANs): GANs[60] are a rapidly evolving research area. They are a clever way of training a generative model that consists of two sub-models: the generator model and the discriminator model . These two models are ANNs typically trained together with stochastic gradient descent (SGD). The key idea is that the discriminator’s job is to differentiate whether the sample it is looking at was generated by the generator or came from the training dataset. In the de novo molecular design, the sample generated is a molecule, and the training data is a library of valid molecules (Figure 9). The learns the training data distribution to fool . The distribution is compressed into a latent space, from which the generator draws inputs for creating new molecules.

Figure 9

GAN architecture for molecular design.

and have different objectives, and they can be seen as two players in a minimax game:where is the data distribution. GANs are implicit generative models, i.e., there’s inference of model parameters without the specification of a likelihood. The two models are trained until D is fooled about half the time, meaning G is generating valid molecules from the distribution of the training data. Figure 9 shows the general GAN architecture used for molecular design. GAN architecture for molecular design. 5.2c Reinforcement learning (RL) RL is an autonomous, self-teaching algorithm that learns through trial and error dynamically. Like a pet trained using treats and punishments, these algorithms are rewarded when they make the right decisions and penalized when they make the wrong ones. It performs actions with the aim of maximizing rewards. RL has been used in domains like robotics, self-driving cars, and board games. In RL, the information given to the system is intermediate between supervised and unsupervised learning.[61] The samples for RL don’t contain the desired input-output pairs. Instead, they give indications on whether an action is correct or incorrect. Given a state s S, an RL agent has to choose which action a A has to take, where S and A are the set of possible states and actions, respectively. For this, the agent learns a policy for an unknown dynamic environment, which defines its behavior. Essentially, the policy maps the perceived states to the actions taken therein, with the objective of maximizing its expected reward over time. The reward indicates how good it was to take an action at a certain state. RL problems are generally framed as Markov decision processes (MDPs). This means there is a fully observable environment with deterministic dynamics where the current state would contain all information necessary to choose an action. Awareness of the past states doesn’t add more knowledge. However, this is only an approximation for many real problems. In partially observable Markov decision process (a generalization of MDP), the agent can interact with an incomplete representation of the environment. This has been useful in instances like SMILES generation, as the drug likeliness makes sense to completed SMILES string. There is a renewed interest in RL,[62] especially when it is combined with DNNs. This is known as deep RL. This can create something fantastic like Deepmind’s AlphaGo, an algorithm that beat the world champions of the Go board game. The game has a theoretical complexity of more than 10140 possible solutions.[63] An analogy can be seen with the complexity of CCS exploration, showing the potential of the algorithm. RL has been successfully applied in de novo drug design. One of the popular RL approaches involves the agent building new molecules in step–wise fashion.[64,65] Simm et al.[64] designed molecules by sequentially drawing atoms from a given bag and placing them onto a 3D canvas. Intuitively, the agent is rewarded for placing atoms so that the energy of the resulting molecules is low. Figure 10 shows a general pipeline of a deep RL approach for generating molecules with desired properties via SMILES. Here, the agent generates molecules and is rewarded if the molecular properties predicted through the QSAR are desirable. Deep RL can also be employed for optimization of molecules with desired properties.[66,67]

Figure 10

A Reinforcement Learning method where the desired molecular properties are used as a reward for generating desired structures.

Goals and advances

Application of ML methods to problems in chemistry, biology, materials, etc., has taken a giant leap during the last few years.[68] This section presents selected popular fields that have witnessed immense progress through ML.

Molecular property prediction

Since the emergence of atomistic theory, chemists have strived to predict the properties of molecular systems without actually synthesizing them. Molecular property prediction has applications in many fields like quantum mechanics, physical chemistry, biophysics, and physiology.[10,69,70] The molecular properties range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Recently, it has attracted much attention since it accelerates the discovery of substances with desired characteristics, such as drug design with a specific target.[71-75] Molecular properties like the total energy of a system are most accurately calculated by QM or Density Functional Theory (DFT) methods, but the process is computationally expensive for an exhaustive exploration of the CCS.[76] The Schrödinger equation (SE) helps us find the electron density for simple systems of small size, but solving it for complex many-body systems is almost impossible. DFT, the computational modelling methods derived or approximated from the SE, are impractical for large systems because the complexity is , where N is the number of atoms. For modelling such systems, methods like those involving MM force fields are adopted. Essentially, force fields provide the potential energy of a molecule as a function of nuclear positions.[77] However, these methods improve speed by compromising accuracy. ML methods are replacing traditional calculations at an increasing rate since they can predict properties that are of DFT accuracy and are comparable to MM in terms of speed. These ML methods aim to learn a function that maps a molecule to the property of choice. Just last year, there have been a notable number of scientific papers on ML applications in the prediction of molecular properties.[78-82] There are 3 main steps in learning QSPRs: generating a training set with measured properties, preparing suitable molecular descriptors or inputs, and building an ML architecture to predict the measured properties from the inputs. Early studies applying ML to QSPR tasks employed linear regression models, which were quickly surpassed by Bayesian neural networks and other approaches.[83,84] In 2012, von Lilienfeld proposed an ML method based on non-linear statistical regression to predict the atomization energies of organic molecules. The supervised learning method used a subset of 7000 stable organic compounds from GDB. Their cartesian coordinates and nuclear charges were encoded into a CM as inputs, without any explicit feature engineering. With a training set of only 1000 compounds, the model achieved a mean absolute error (MAE) of 14.9 kcal/mol. This extraordinary result showed that an ML method could predict QM properties with reasonable accuracy without having to solve the SE explicitly. Over the years, various traditional ML methods have been employed.[85] These methods generally rely on rule-based feature engineering. ANNs are popular among recent state-of-the-art publications.[86] DL models are capable of automatic feature learning and are widely employed for prediction.[57,87,88] Laghuvarapu et al.[28] developed BAND neural network, a DL framework for atomization energy prediction and geometry optimization of small organic molecules. The model was remarkably accurate and robust over the conformational, configurational, and reaction space. It also performed reasonably well on larger molecules than the ones in their training set. Most studies are on organic molecules. Inorganic molecules, especially clusters, need to be studied more. Modee et al.[89] introduced the Deep Learning Enabled Topological (DART) model, which uses Topological Atomic Descriptor (TAD) as a feature vector for energy prediction of metal clusters. Although DL has been successful in property prediction, it is still in its infancy.[69,90,91] In 2017, Goh et al. proposed ChemNet for prediction by using 2D RGB images of molecular diagrams as inputs.[88] Grid-like transformations like these usually cause loss of molecular information lying in non-euclidean space, where the molecule’s internal spatial and distance information are not complete.[92] Geometric DL encompasses the emerging techniques that aim to generalize DNNs to non-Euclidean domains, such as graphs and manifolds.[92] Graph neural networks (GNNs) achieved superior performance in various domains and have shown great potential for molecular property prediction, as they can directly handle non-euclidean data.[78,82,91,93-96] Variants of G[71]NNs like Message Passing Neural Networks (MPNNs)[93], Schnet[94] and Multiscale Graph Convolutional Networks (MGCNs)[71] use graph representation of molecules for prediction. They have several neural layers to project each node of the graph into latent space with a low dimensional embedding. The node embeddings (interaction messages) are propagated and updated using the embeddings of their neighborhood iteratively. This is called message passing. The node embeddings are then pooled for property prediction. Pathak and others[95,96] developed a GNN-based solution that accurately predicts solvation free energies and is interpretable. The first phase of the model utilized MPNN to compute inter-atomic interaction within both solute and solvent molecules expressed as molecular graphs. Though GNNs are successful, they are generally data-hungry. Labeled molecules usually span a small portion of the CCS since they can only be generated by expensive and time-consuming techniques. Other unlabelled valid molecules may also have structural benefits. Methods like unsupervised, semi-supervised, and self-supervised learning provide effective solutions to incorporate these unlabelled molecules.[79-81] Property prediction ML models have achieved high scalability and high prediction quality across both chemical and conformational space. Due to this, they are also employed in various MD simulation tasks like analyzing MD trajectories, and to enhance sampling.[97,99,100] As explained above, ML has shown extraordinary potential in accurate predictions of quantum mechanical properties such as the electronic energies. These efforts have been accomplished by using supervised learning based on a large amount of pre–computed data. Availability of such data has allowed for circumventing the explicit need to solving the Schrödinger equation. While analytical solution is elusive for multi–electron systems, accurate numerical solutions using configuration interaction and coupled–cluster methods are computationally prohibitive. In practice, a trade–off between computational efficiency (expense) and accuracy is made in making a choice of an appropriate wavefunction approximation. ANNs are universal approximate functions and few studies have explored their application for obtaining an ab initio solution for many–electron Schrödinger equations. Carleo and Troyer proposed the neural networks to represent the wavefunction that are trained in an unsupervised manner using the variational principle.[101] They showed high accuracy in describing the ground and excited states of interacting spin models in up to two dimensions demonstrating the possibility of applying ANNs for solving quantum many–body systems. Han et al. used deep NNs as trial wavefunctions and used variational Monte Carlo method for obtaining the optimal wavefunction (DeepWF).[102] Pfau et al. introduced Fermionic neural network (FermiNet) that obeys Fermi-Dirac statistics. They showed quantitative accuracy in calculating the dissociation curves of nitrogen molecule and H10.[103] More recently, in a seminal paper, Hermann et al. reported a deep NN representaiton of electronic wavefucntion named PauliNet. They demonstrated that this method outperforms traditional variational methods on systems up to 30 electrons.[104] Using these approaches, the curse of limited basis sets, a major source of inaccuracies in computational quantum mechanical methods is overcome. Applying ANNs for solving many body quantum systems have just begun and research in this direction opens up exciting opportunities in modeling chemical systems efficiently and accurately.

Molecular dynamics simulations

With the advance in algorithms and power of computing resources, MD simulations have become an integral tool for analyzing molecular systems.[10,105] It has helped us analyze thermodynamic and dynamic properties of molecules, create 4D molecular descriptors, probe complex processes such as protein folding and facilitated many other purposes.[106,107] MD is a computer simulation approach for analyzing the time evolution of an interacting molecular system.[108,109] The motion of the system (atomic trajectories) is generated by solving the classical Newtonian dynamic equations for a specific interatomic potential defined by the initial and boundary conditions.[110,111] The predictive power of the simulations depends on the underlying potential energy surface (PES).[112,113] Hence, they require a precise PES U(x), which is a function of atomic coordinates x. Molecular modeling techniques are mostly based on either QM methods (e.g., DFT), or on force fields (e.g., Stillinger-Weber potentials). Both techniques stand at the opposite sides of the cost-accuracy trade-off. The approximations to U(x) lack transferability. Studies have shown that ML methods are capable of creating interatomic potentials that surpass conventional methods both in terms of accuracy and versatility. As mentioned earlier, they are much faster than QM methods and have comparable accuracy. In 2007, Behler & Parrinello[73] proposed an ANN solution to extract PES. They achieved transferability through parameter sharing and the summation principle, meaning the network could adjust to molecules of any size. Since then, other ML PES models have emerged, like Deep Potential net and ANI networks. Most ML PES models are based on nonlinear kernel learning or ANNs, each having its own advantages.[99] For elemental solids, Gaussian approximation potentials (GAP)[114,115] are nowadays used in MD simulations. It provides insights into various domains, for example, amorphous states of matter.[116] Pattnaik et al.[117] used the data obtained using DFT on small systems and simulated large systems by taking liquid argon as a test case. ML models have been shown to have the potential to mimic MD trajectories produced through simulations.[118-120] Tsai et al.[120] used LSTMs to learn the evolution of MD trajectories that were mapped into a sequence of characters in some languages. In addition to force fields, ML has designed molecular models at resolutions coarser than atomistic models, as atomistic models are computationally expensive to simulate. For example, CGnets can be used to coarse grain away all the solvent molecules in a protein and map the atoms of each residue to the corresponding Ca atom. ML has made a variety of contributions to the analysis and simulation of MD trajectories.[98,99] For instance, it has enabled the estimation of free energy surfaces. Along with enhanced sampling methods, it has also attempted to learn the free energy surface on the fly. Studies have also employed ML in building Markov state models and dynamic graphical models of molecular kinetics. For example, VAMPnets was developed as a substitution to the complex and error-prone technique of constructing Markov state models. Other contributions of ML in this domain include ML-driven definition of optimal reaction coordinates, enhancement of sampling through learning bias potentials and selection of starting configurations through active learning. In the field of molecular design, ML can quickly explore vast spaces of CCS for generating molecules of desired properties, avoiding MD simulations altogether. The next section presents this idea.

Inverse molecular design

Molecular design algorithms aim to virtually create and analyze molecules with relevant optimized properties like synthetic accessibility, ADMET (absorption, distribution, metabolism, elimination, and toxicity) profile etc.[121,122] Finding new chemical compounds for drug discovery can be portrayed using the metaphor “finding a needle in a haystack”. (Schneider et al., 2019) In this case, the haystack is the universe of synthetically feasible molecules in the CCS, wherein a single molecule with various desired properties is searched for. A clever navigation is required to explore vast chemical spaces efficiently. Forward strategies for molecular design lead from CCS to the properties using experiments, simulations, gradient-based algorithms, Monte Carlo or genetic algorithms, or combinations thereof. This means that the input is the molecular structure, and the output is the properties of molecules. These direct methods have been successful in their application domains; however, they are unable to quickly cover relevant large chemical spaces.[123] Inverse molecular design has emerged as an attractive approach to take on these challenges.[58,124] As its name suggests, it inverts the direct approach by taking the desired properties as input and identifying an optimized molecular structure as output. The approach need not necessarily identify one unique structure but a distribution of probable structures. Valid molecules with similar functionalities lie nearby on a continuous curve or manifold. Inverse design uses optimization, sampling, and search methods to navigate the functionality manifold of CCS.[125] One of the earliest attempts in inverse design was high-throughput virtual screening (HTVS). HTVS is performed to ascertain an initial set of candidate molecules, called “hits”. In HTVS, molecules from large small-molecule drug libraries are evaluated for properties such as the binding affinity, against a target receptor. More recent techniques involving optimization can be roughly divided into two types: evolutionary techniques and ML algorithms.[58] Recently, Mehta et al.[126] proposed an ML framework “MEMES” based on Bayesian optimization for efficient sampling of chemical space. The architecture identifies 90% of the top-1000 molecules from a dataset of about 100 million molecules, while calculating the docking score only for about 6% of the dataset. Recent ML-driven methods have accelerated the search for new molecules with desired properties. Generative models such as VAEs,[57,127] RNNs,[128,129] GANs[130] and Generative Pre-Training (GPT)[131] can model complex SPRs and use them to create molecular designs. Pathak et al.[59] proposed a deep learning based inorganic material generator (DING) framework that employs conditional variational autoencoders (CVAE) as a generator and DNNs as a predictor of enthalpy of formation, volume per atom and energy per atom. Bagal et al.[131] trained a GPT model, named MolGPT, to predict a sequence of SMILES tokens for molecular generation. The model can be trained conditionally to optimize multiple properties of the generated molecules, including scaffold conditioning. However, these models require large training data for learning valid molecular distributions. In RL, an agent builds new molecules in a step-wise fashion.[64-66] Training an RL agent only requires samples from a reward function. So, the need for a training data is reduced. The generative process must be restricted or biased towards desirable qualities as mentioned earlier in“AEs”section. In VAEs, the latent space allows direct gradient-based optimization of desired properties, as it’s continuous. Nevertheless, the functionality manifold has local minimas. Bayesian optimization or constrained optimization, with Gaussian processes, is applied to explore a smoothed version of the manifold.[58] In the case of GANs and RNNs dealing with non-continuous data, a gradient estimator is required to backpropagate the generator. RL has been employed as an approach to bias the generation process by rewarding the generator’s behaviors. Some examples are methods involving Q-learning and policy gradients (SeqGANs and BGANs). Several studies have adopted RL for the generation of drug-like molecules. Popova et al. proposed Reinforcement Learning for Structural Evolution (ReLeaSE), a de novo molecular design method.[132] Molecular applications have adopted models that are a combination of generative algorithms to utilize the advantages from each. For example, druGAN[133] adopts an adversarial autoencoder network, RANC[134] adopts both RL and adversarial network. Few promising research directions in this domain include structured architectures such as multilevel VAE and inverse RL. Developments in inverse RL may allow for the discovery of reward functions associated with different molecular design tasks.[58]

Materials discovery and design

New materials can contribute to the immense progress in tools and technology.[135,136] Materials discovery and design aim to find candidate materials with desired properties that are synthesizable.[137] This would allow experimental researchers to perform targeted explorations. Materials screening via traditional experiments or computational simulations involve element replacement and structure transformation.[135] The chemical compositional and structural search space tends to be constrained in these methods.[135,138] ML is employed for finding solutions to various problems in materials science as it has led to a decrease in materials development time and cost.[135,136,139-143] There are now many examples, such as thermoelectrics and photovoltaic materials,[144] metal organic frameworks (MOFs),[145] metallic glass,[146] polymers,[147] and DNA nanostructures,[148] in which ML has been applied to move away from the traditional methods. ML has performed well in areas such as materials property prediction,[149-151] novel materials discovery,[59,152-155] process optimization,[156,157] finding density functionals,[158] and other materials-related studies.[135,159,160] Finding new chemical components and their crystal structures that likely match the composition and properties of desired materials, is an essential step in novel materials discovery.[136] ML is used to learn and screen for potential combinations of chemical components and structures from a large dataset containing real and synthesized materials. Then, the most-probable crystal structures need to be identified and tested for stability. The number of candidate compounds is still huge because of the extremely large combination space of compositions and structures.[137] Therefore, these candidate new compounds still need to be tested by first-principles calculation (e.g. DFT). Hautier et al.[161] demonstrated how the search for novel materials can be accelerated using a combination of ML techniques and high-throughput ab initio computations. Methods involving VAEs have recently been applied to solid-state materials[154] and porous materials.[162] GANs are finding their position in materials design too. A recent application is ZeoGAN[155] – employed in the generation of an energy grid of guest molecules and zeolite structures. RL has been effective for exploring chemical space for different applications, such as MOFs for gas adsorption, and synthesis planning. Dieb et al.[163] used RL to design depth-graded multilayer structures, known as supermirrors, for X-ray optics applications. Active learning approaches are also gaining attention in the field. It allows the exploration of new regions of space that were not in the initial dataset.[142,164] This is done by adding new data points to the training set on the fly based on model uncertainty.

Other domains

ML has played roles in several other problems, such as protein–protein interactions, viable retrosynthetic pathways, stability of solids, etc. ML-based scoring functions have been shown to perform significantly better than software like AutoDock Vina for predicting both binding poses and affinities.[165] Finding functionally relevant binding sites on the 3D structure of a protein is crucial for drug design. Aggarwal et al.[166] proposed a method that is a combination of geometry–based software and DL, called DeepPocket, that utilises 3D CNNs for making this process accurate. Results from ML methods in molecular sciences have been applied for many practical purposes. For example, many results of generative models have been used in pharmaceutics.[167] They aid in drug design by generating molecular systems and optimizing relevant medicinal properties such as solubility in water, ADMET profile and synthesizability. Healthcare systems also employ ML to analyse various health-related issues and accelerate decision-making processes efficiently.[168,169] To illustrate, the COVID-19 pandemic has witnessed numerous ML methods such as those by Alle et al.[170] and Karthikeyan et al.,[171] who have provided risk stratification and mortality prediction models for patients with COVID-19. Another area of rapid development is imaging and -omics technologies, which will further blur the barrier between cheminformatics and bioinformatics.[172,173] Thus, molecular biology, transcriptomics, proteomics etc. are getting more relevant for ML researchers in molecular sciences.[166,174]

Challenges and outlook

Apart from successfully performing desired tasks, ML methods also provide novel insights and transformational ideas. For instance, analysing the weights of trained ML prediction models can potentially lead to automatic discovery of scientific laws and principles, which can cause a revolutionary development in science.[143] Another impressive example is from ML for molecular discovery, where the corresponding statistical view and analysis of the discovered chemical space leads to fresh insights, discoveries of molecules with unexpected properties, hints for new chemical reaction mechanisms, and more. However, current successful applications of ML in molecular sciences have only scratched the surface of possibilities.[100] One of the challenges is encoding the essential characteristics of a molecule into its numerical representation. This is one of the most effective ways to infuse physics in ML and generalise better. Attempts have been made to define criteria for the development of molecular descriptors, but adhering to all the criteria is difficult. From the perspective of atomic interactions, current molecular representations describe local chemical interactions well, but completely miss long-range interactions like polarization and van der Waals dispersion. Moreover, capturing highly complex QM interactions like distracted attraction and exchange repulsion, especially in the large molecules (Kollman 1985), has been difficult. An important direction for future progress in studying large complex molecular systems would be incorporating intermolecular interaction theory, such as Hamiltonians for electronic interactions based on SFT, molecular orbital techniques, or the many- body dispersion method, into ML. Further research into the criteria and creation methods of molecular descriptors will be necessary.[46] Another challenge is the limited amount of labeled molecular data available compared to other domains. This poses the inherent danger of ML models overfitting to benchmarks. Thus, progress needs to be made in reducing the cost of data generation. Due to the combinatorial scaling in CCS, it’s also crucial to infuse physics and invariance information in ML and achieve robustness and accuracy using smaller datasets. A few of the promising methods in this context include employing smart sampling methods, identifying valuable data points for training, and employing recent techniques such as transfer learning, meta-learning, or active learning.[175,176] Recently, a bayesian framework performed as well as humans on one-shot learning problems with limited data.[143] Applying ML in molecular sciences is a young domain. Hence, much of the infrastructure is still in its early stages or waiting to be developed. Drug discovery operates as a feedback loop, where the large number of molecules designed by generative models must be synthesized and validated experimentally to provide feedback for further decision making.[122] These experiments are slow and expensive. Although prediction models can be coupled with generative models to streamline this process, the synthetic tractability of these molecules remain a challenge.[177] Efforts taken in future towards closing the loop need to consider incorporating AI/ML, intelligent systems, embedded systems and robotics into one framework.[58] This can lead to automated laboratories.[178] This rapidly growing field in computational science, supported by increasing computing power, data sharing and open-source tools, has the potential to solve many theoretical and practical challenges. Beyond these numerous unsolved challenges lies the“chemical discovery revolution!”.[116]

95 in total

1. Fast and accurate modeling of molecular atomization energies with machine learning.

Authors: Matthias Rupp; Alexandre Tkatchenko; Klaus-Robert Müller; O Anatole von Lilienfeld
Journal: Phys Rev Lett Date: 2012-01-31 Impact factor: 9.161

Review 2. Nanomaterials. Programmable materials and the nature of the DNA bond.

Authors: Matthew R Jones; Nadrian C Seeman; Chad A Mirkin
Journal: Science Date: 2015-02-20 Impact factor: 47.728

3. Learning Atomic Interactions through Solvation Free Energy Prediction Using Graph Neural Networks.

Authors: Yashaswi Pathak; Sarvesh Mehta; U Deva Priyakumar
Journal: J Chem Inf Model Date: 2021-02-05 Impact factor: 4.956

4. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17.

Authors: Lars Ruddigkeit; Ruud van Deursen; Lorenz C Blum; Jean-Louis Reymond
Journal: J Chem Inf Model Date: 2012-11-01 Impact factor: 4.956

5. Reinforced Adversarial Neural Computer for de Novo Molecular Design.

Authors: Evgeny Putin; Arip Asadulaev; Yan Ivanenkov; Vladimir Aladinskiy; Benjamin Sanchez-Lengeling; Alán Aspuru-Guzik; Alex Zhavoronkov
Journal: J Chem Inf Model Date: 2018-06-12 Impact factor: 4.956

6. NMRShiftDB -- compound identification and structure elucidation support through a free community-built web database.

Authors: Christoph Steinbeck; Stefan Kuhn
Journal: Phytochemistry Date: 2004-10 Impact factor: 4.072

7. Deep-neural-network solution of the electronic Schrödinger equation.

Authors: Jan Hermann; Zeno Schätzle; Frank Noé
Journal: Nat Chem Date: 2020-09-23 Impact factor: 24.427

8. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space.

Authors: Katja Hansen; Franziska Biegler; Raghunathan Ramakrishnan; Wiktor Pronobis; O Anatole von Lilienfeld; Klaus-Robert Müller; Alexandre Tkatchenko
Journal: J Phys Chem Lett Date: 2015-06-18 Impact factor: 6.475

9. Optimization of Molecules via Deep Reinforcement Learning.

Authors: Zhenpeng Zhou; Steven Kearnes; Li Li; Richard N Zare; Patrick Riley
Journal: Sci Rep Date: 2019-07-24 Impact factor: 4.379

10. PubChem in 2021: new data content and improved web interfaces.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

1 in total

1. Materials Discovery With Machine Learning and Knowledge Discovery.

Authors: Osvaldo N Oliveira; Maria Cristina F Oliveira
Journal: Front Chem Date: 2022-07-07 Impact factor: 5.545

1 in total