Literature DB >> 35996575

Uncertainty quantification: Can we trust artificial intelligence in drug discovery?

Jie Yu^1,2, Dingyan Wang^1,2, Mingyue Zheng^1,2.

Abstract

The problem of human trust is one of the most fundamental problems in applied artificial intelligence in drug discovery. In silico models have been widely used to accelerate the process of drug discovery in recent years. However, most of these models can only give reliable predictions within a limited chemical space that the training set covers (applicability domain). Predictions of samples falling outside the applicability domain are unreliable and sometimes dangerous for the drug-design decision-making process. Uncertainty quantification accordingly has drawn great attention to enable autonomous drug designing. By quantifying the confidence level of model predictions, the reliability of the predictions can be quantitatively represented to assist researchers in their molecular reasoning and experimental design. Here we summarize the state-of-the-art approaches to uncertainty quantification and underline how they can be used for drug design and discovery projects. Furthermore, we also outline four representative application scenarios of uncertainty quantification in drug discovery.

Entities: Chemical

Keywords: Applied computing; Artificial intelligence; Drugs

Year: 2022 PMID： 35996575 PMCID： PMC9391523 DOI： 10.1016/j.isci.2022.104814

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Artificial intelligence (AI) and other data-driven approaches are reshaping drug discovery and design processes. For tasks with large amounts of training data, supervised learning can effectively map the relationship between inputs and outputs. A typical scenario is predicting protein structure based on primary sequence, where AlphaFold2 (Jumper et al., 2021) is believed to have solved this half-century problem (Buel and Walters, 2022). However, in most drug design tasks, the amounts of available training data are often limited (Altae-Tran et al., 2017). The inconsistency between the distribution of training data and test data may cause the model to produce unreliable outputs, which may have adverse consequences on decision-making procedure of drug design (Begoli et al., 2019). Unfortunately, classical deep learning (DL) models do not provide confidence estimation for their outputs. For regression tasks, the output is a single deterministic value without any uncertainty measurement. For classification tasks, the output is a probability distribution, which can be taken as the prediction confidence to some extent but is often poorly calibrated (Mervin et al., 2020). To illustrate this more vividly, we built a toy dataset, in which x is a real number ranging from 0 to 20 and y is a binarized label indicating whether is larger than 0.5 (y = 1), or otherwise (y = 0). As shown in Figure 1A, the toy dataset is split into the training part (x < 12) and the test part (x≧12). A neural network with 2 hidden layers and the Softmax output layer was trained on the training set. Figure 1B shows the probabilities given by the model on the training set and the test set. As shown, the model is well fitted on the training part, but gives overconfident false prediction on the test part. It is observed that the probability solely given by the Softmax function cannot be taken as the confidence of the prediction reliably. Thus, novel UQ strategies that are more effective, well-calibrated, and compatible with the different structures of neural networks are highly demanded. (Mervin et al., 2021a).

Figure 1

The probability given by the Softmax function cannot be taken as the confidence of the prediction reliably

(A) A toy dataset is built for illustration, in which is a real number ranging from 0 to 20 and is a binarized label indicating whether is larger than 0.5 ( =1), or otherwise ( =0). The dataset is split into the training part () and the test part (). A neural network with 2 hidden layers and the Softmax output layer was trained on the training set.

(B) The figure shows the probability given by the model on the training set and the test set. As it can be seen, the model is well fitted on the training part, but gives overconfident false predictions on the test part.

The probability given by the Softmax function cannot be taken as the confidence of the prediction reliably (A) A toy dataset is built for illustration, in which is a real number ranging from 0 to 20 and is a binarized label indicating whether is larger than 0.5 ( =1), or otherwise ( =0). The dataset is split into the training part () and the test part (). A neural network with 2 hidden layers and the Softmax output layer was trained on the training set. (B) The figure shows the probability given by the model on the training set and the test set. As it can be seen, the model is well fitted on the training part, but gives overconfident false predictions on the test part. Evaluating the quality of a UQ method is tricky owing to the requirement of taking application scenarios and objectives of users into consideration, but in general, the ranking and calibration ability of UQ methods are the most two aspects that we are concerned. Ranking ability is intended to characterize the correlation between uncertainty and error. A UQ method with an ideal ranking ability should assign higher uncertainty values to predictions with larger errors. For regression tasks, appropriate correlation coefficients (e.g., Spearman correlation coefficient) can be used to quantitatively describe the correlation between prediction error and uncertainty. For classification tasks, it is expected that the wrong predicted samples could be intelligently prioritized by uncertainty. Specifically, the samples that are incorrectly and correctly classified can be regarded as positives and negatives, respectively, and then the ranking ability of UQ methods can be quantified by auROC (area under the receiver operating characteristic curve) or auPRC (area under the precision−recall curve). Calibration ability is intended to characterize the ability to indicate the error distribution. For example, under the regression setting, it is expected that a UQ model could precisely estimate the variance of the error distribution, which is useful and important for confidence interval estimation. In the chemistry community, there have been some concepts similar to uncertainty quantification for a long time, among which the most common one is the definition of the AD (applicability domain) (Sheridan, 2012, 2013, 2015) of QSAR (quantitative structure-activity relationship) models. In the following content, we will clearly specify the relationship between the two in this review to avoid confusion. UQ and AD share the same purpose: to help researchers determine whether the prediction result of a sample is reliable. Predictions for compounds outside the application domain will be thought to be less reliable (corresponding to higher uncertainty), and vice versa. Thus, UQ and AD are closely linked. Compared with UQ, traditional applicability domain definition methods are more input-oriented, generally considering the feature space or sub-feature space of samples, less considering the structure of the model itself. Correspondingly, the concept of UQ is broader and can refer to all the methods used to determine whether a prediction is reliable or not in general. As a result, AD definition methods are conceptually covered by UQ. Here, some classical AD definition methods are classified as similarity-based UQ methods and will be introduced in the “similarity-based approaches” section. In this article, we intend to give a review of the concept, methods, and applications of UQ in the current drug design and discovery paradigm. It is worth noting that we will not thoroughly cover the available UQ strategies out of the context of drug design, especially considering that the review of Abdar et al. (Abdar et al. (2021) has conducted this job. Instead, we pay more attention to specific application cases of UQ and explain the underlying principles of the methods used, and we hope this review will give insights and practical guidance for deploying trustworthy AI models in drug design.

Sources of uncertainty in drug discovery

According to different sources, uncertainty can be broadly divided into three categories: approximation, aleatoric and epistemic uncertainties (Kiureghian and Ditlevsen, 2009). Approximation uncertainty accounts for the errors caused by the incompetence of simplistic models to fit complex data, such as the error made by a linear model fitting a sinusoidal curve (Tagasovska and Lopez-Paz, 2019). However, because deep neural networks are known to be universal approximators, approximation uncertainty is always assumed to be negligible. More details are directed to a study by Lazic et al., which provides an introduction to sources of uncertainty, including the approximation uncertainty (Lazic and Williams, 2021). In this section, we will focus on the introduction of aleatoric and epistemic uncertainties.

Aleatoric uncertainty

Aleatoric uncertainty (derived from the Latin alea, which means the rolling of dice) describes the intrinsic random nature (noise) of data to be modeled (Tagasovska and Lopez-Paz, 2019). In Figure 2A, the fitted model is represented as a black solid line, and the observed data are represented as red points. As it can be seen, the model assigns a lower aleatoric uncertainty to the data points in a regular pattern (low noise data), and a higher aleatoric uncertainty to the data points in a random pattern (high noise data). As an inherent attribute of data, aleatoric uncertainty cannot be reduced by collecting more training data. In drug discovery projects, the data noise is always derived from the different experimental measurements that are complicated by two main sources of error: systematic error and random error (Kolmar and Grulke, 2021). Hence, aleatoric uncertainty is often used to estimate whether the maximal performance of a model has been reached (i.e., when models approximate experimental error) (Beker et al., 2020), which will be detailed in the “improving model accuracy and robustness” section.

Figure 2

Illustration of aleatoric uncertainty and epistemic uncertainty

The fitted model is represented as a black solid line and the observed data are represented as red points. The blue area means the 95% confidence interval (uncertainty measurement).

(A) A probabilistic neural network is built to provide a confidence interval of the prediction. The model assigns a lower aleatoric uncertainty to the data points in a regular pattern (low noise data), and a higher aleatoric uncertainty to the data points in a random pattern (high noise data).

(B) A Gaussian Regression Process model is used to provide a confidence interval of the prediction. The predictions in the space with no (or lack of) observed data points are assigned higher epistemic uncertainty, but the predictions in the space with observed data points are assigned lower epistemic uncertainty.

Illustration of aleatoric uncertainty and epistemic uncertainty The fitted model is represented as a black solid line and the observed data are represented as red points. The blue area means the 95% confidence interval (uncertainty measurement). (A) A probabilistic neural network is built to provide a confidence interval of the prediction. The model assigns a lower aleatoric uncertainty to the data points in a regular pattern (low noise data), and a higher aleatoric uncertainty to the data points in a random pattern (high noise data). (B) A Gaussian Regression Process model is used to provide a confidence interval of the prediction. The predictions in the space with no (or lack of) observed data points are assigned higher epistemic uncertainty, but the predictions in the space with observed data points are assigned lower epistemic uncertainty.

Epistemic uncertainty

Epistemic uncertainty (derived from Greek episteme, which means “knowledge”) represents the errors associated with the lack of knowledge of the trained model in certain regions of the sample space (e.g., the chemical space outside AD of the model) (Tagasovska and Lopez-Paz, 2019). As shown in Figure 2B, the predictions in the space with no (or lack of) observed data points are assigned higher epistemic uncertainty, but the predictions in the space with observed data points are assigned lower epistemic uncertainty. Hence, unlike aleatoric uncertainty, epistemic uncertainty can be neutralized by collecting the data in those low-density regions. Samples with higher epistemic uncertainty can provide more informative insights into models (e.g., novel structure-activity relationship). Therefore, epistemic uncertainty can be used to guide experiment design to annotate data with less experimental cost while maximizing a model’s performance gain (Ding et al., 2021). The corresponding application is referred to as active learning (AL), which will be detailed in the “active learning” section.

Methods of uncertainty quantification

A large number of UQ methods have been deployed in drug discovery projects. Here, we put forward a new taxonomy to track the development path of various UQ methods. By focusing on the theoretical foundations of these UQ methods, we categorize them into three types: similarity-based, Bayesian, and ensemble-based approaches. For clarity, we summarized their core ideas, representative methods, and example applications in Table 1. These UQ methods and associated concepts are reviewed in the following sections.

Table 1

The summary of the uncertainty quantification methods

UQ methods	Core idea	Representative methodsa	Example applicationsa
Similarity-based	If a test sample is too dissimilar to training samples, the corresponding prediction is likely to be unreliable.	1. Box Bounding (Netzeva et al., 2005) 2. Convex Hull (Jaworska et al., 2005) 3. DM (Sheridan et al., 2004) 4. SDC score (Liu et al., 2018) 5. NNAS (Allen et al., 2020)	1. Virtual screening (Berenger and Yamanishi, 2019) 2. Anticancer peptide activity prediction (Chen et al., 2021) 3. SARS-CoV 2 inhibitor prediction (Gawriljuk et al., 2021) 4. Toxicity prediction (Jiang et al., 2021)
Bayesian	Parameters and outputs are treated as random variables and maximum a posteriori (MAP) estimation is adopted according to Bayes’ theorem.	1. VI (MC-dropout) (Gal and Ghahramani, 2016) 2. BNN (Goan and Fookes, 2020) 3. GP-MGK (Xiang et al., 2021) 4. MVE (Nix and Weigend, 1994) 5. Bayesian GCN (Ryu et al., 2019)	1. Molecular property prediction (Zhang and Lee, 2019) 2. Virtual screening (Ryu et al., 2019) 3. Protein-ligand interaction prediction (Kim et al., 2021)
Ensemble-based	The consistency of the predictions from various base models is an estimate of confidence.	1. Bootstrapping (Scalia et al., 2020) 2. RF (Sheridan, 2012) 3. DeltaDelta (Jimenez-Luna et al., 2019) 4. Deep ensemble (Lakshminarayanan et al., 2017) 5. MC-dropout (Gal and Ghahramani, 2016)	1. Drug-likeness prediction (Beker et al., 2020) 2. Molecular property prediction (Scalia et al., 2020) 3. Lead optimization (Jimenez-Luna et al., 2019)

The representative methods and example applications are not exhaustive.

The summary of the uncertainty quantification methods Box Bounding (Netzeva et al., 2005) Convex Hull (Jaworska et al., 2005) DM (Sheridan et al., 2004) SDC score (Liu et al., 2018) NNAS (Allen et al., 2020) Virtual screening (Berenger and Yamanishi, 2019) Anticancer peptide activity prediction (Chen et al., 2021) SARS-CoV 2 inhibitor prediction (Gawriljuk et al., 2021) Toxicity prediction (Jiang et al., 2021) VI (MC-dropout) (Gal and Ghahramani, 2016) BNN (Goan and Fookes, 2020) GP-MGK (Xiang et al., 2021) MVE (Nix and Weigend, 1994) Bayesian GCN (Ryu et al., 2019) Molecular property prediction (Zhang and Lee, 2019) Virtual screening (Ryu et al., 2019) Protein-ligand interaction prediction (Kim et al., 2021) Bootstrapping (Scalia et al., 2020) RF (Sheridan, 2012) DeltaDelta (Jimenez-Luna et al., 2019) Deep ensemble (Lakshminarayanan et al., 2017) MC-dropout (Gal and Ghahramani, 2016) Drug-likeness prediction (Beker et al., 2020) Molecular property prediction (Scalia et al., 2020) Lead optimization (Jimenez-Luna et al., 2019) The representative methods and example applications are not exhaustive.

Similarity-based approaches

Similarity-based approaches basically adopt the concept that if a test sample is too dissimilar to training samples, the corresponding prediction is likely to be unreliable. In practice, users should first choose or define a method to measure the distance between the test samples and the training samples, and then the distance can be regarded as the estimated uncertainty of the prediction. Some of these approaches have been widely used to define the AD for QSAR models. A simple similarity-based approach named Bounding Box defines a range of acceptable values for each descriptor based on the distribution of its values in the training set (Netzeva et al., 2005). For a query sample, if the value of at least one descriptor falls out of the defined range, the sample is regarded as “out-of-distribution.” The more descriptors that break the criteria, the more uncertain the prediction is. Instead of directly using the raw descriptor, sometimes a reduced-dimension strategy, like PCA (principal components analysis), will be performed first to reduce the feature space (Carrio et al., 2014). The lower bound and the upper bound of the acceptable range is usually decided by the minimum and maximum value of the descriptor in the training set, but sometimes the top and bottom 5th percentiles are used. Another similarity-based approach considers the activity space similarity rather than feature space (Keefer et al., 2013). It is assumed that if the predicted value of a query sample is not consistent with the labels of the structurally similar training samples, which indicates that the SAR (structure-activity relationship) landscape is not smooth, the prediction is then considered unreliable. Generally, these approaches suffer from the shortcoming of too strong assumption for the distribution of features (independent variables x) or labels (independent variable y). For example, most of these methods assume that features are independent of each other, and can only provide binarized or discrete uncertainty estimation (reliable or unreliable) results, which limits their application. Different from the methods mentioned above, another kind of similarity-based approach considers the overall distance between samples, usually called the distance-to-model (DM) method. The application of the DM method should define the distance between two samples and first, which depends on the format of features. If features are Boolean vectors, Tanimoto similarity (also called Jaccard index) is often used:where is the -th feature value of molecule . If is a continuous vector, Euclidean distance is often used:where M is the total feature length. Once it has decided how to calculate the distance between samples, we can further define the distance between a test sample and the training set, which can be further taken as predictive uncertainty. Many strategies can be applied in this procedure, for example, the average distance to the nearest k training samples (Sheridan et al., 2004) or the distance to the representative average of the training set (Berenger and Yamanishi, 2019). The threshold for defining AD can be decided by analyzing the training data distribution (Sahigara et al., 2013). Instead of computing distances, some methods define an acceptable high-dimensional range and assume that the query sample within this range can be readily predicted. An example is the Convex Hull strategy, which defines the smallest convex area that covers the training points (Jaworska et al., 2005). It can also be taken as an extension of the Bounding Box method. Recently, some more complex similarity-based approaches have emerged. For example, the SDC score proposed by Liu et al. (Liu et al., 2018; Liu and Wallqvist, 2019) uses the contribution of all training molecules to estimate the reliability of a prediction, in which the training sample contribution is weighted down exponentially by the distance. It is noticed that the above-mentioned similarity-based approaches are highly dependent on how to feature samples. However, by engineering raw features, DL models could project samples into a mission-specific latent space in which distances can also be treated as an uncertainty metric. Janet et al. tested this idea on two diverse chemical datasets and found that latent space distance outperformed other well-established uncertainty metrics without any additional training cost (Janet et al., 2019). In a similar way, Allen et al. proposed NNAS (neural network activation similarity), a kind of latent space distance, to increase prediction confidence in toxicity safety evaluation. They found that NNAS outperformed Tanimoto similarity and RFS (random forest similarity) regarding similarity searching (Allen et al., 2020).

Bayesian approaches

The training process of a neural network can be taken as learning the optimal parameters for a probabilistic model . Frequentists and Bayesians adopt different strategies for solving this problem, and their differences are visualized in Figure 3. As shown in Figures 3A and 3C, for frequentists, the parameters are fixed but with unknown quantities, and can be estimated by the maximum likelihood estimation (MLE). This corresponds to the standard training protocol that minimizes the empirical loss (Nix and Weigend, 1994; Scalia et al., 2020). On the other hand, as shown in Figures 3B and 3D, Bayesians treat parameters as random variables and adopt maximum a posteriori (MAP) estimation or directly give the posterior distribution of parameters according to Bayes’ theorem. This is called the Bayesian neural network (BNN), where model weights and outputs are both distributions instead of determined values (Goan and Fookes, 2020). Different from the standard neural network, BNN has the advantage of directly capturing the uncertainty of the prediction (Olivier et al., 2021). To briefly show this, assuming that model parameters follow the prior distribution (e.g., a normal distribution), and model likelihood is , where refers to the feature vectors and refers to the label vector, we obtain the posterior distribution according to the concept of Bayesian inference Equation (3):where corresponds to the training set as “seen” by the model, and the posterior distribution is the joint probability distribution of model weights learned (conditioned) by fitting the training set. Once the distribution of model weights is determined, for a query sample , its prediction , a distribution, can be calculated using Equation (4):where the final prediction could be understood as a “weighted sum” of each prediction for each set of possible model weights , and the probability of depends on the training set . For regression tasks, as is a variable following a distribution instead of a deterministic number in the Bayesian neural network, we can now define the uncertainty of as its variance, which can be calculated according to Equation (5):

Figure 3

Comparison between the traditional neural network and Bayesian neural network

The outputs and parameters of the traditional neural network are deterministic values (A and C), while in the Bayesian neural network they are distributions (B and D).

Comparison between the traditional neural network and Bayesian neural network The outputs and parameters of the traditional neural network are deterministic values (A and C), while in the Bayesian neural network they are distributions (B and D). It can be seen that the total uncertainty is decomposed into the former term aleatoric uncertainty and the latter term epistemic uncertainty, which has been introduced in the “sources of uncertainty in drug discovery” section. Directly using Equation (5) to calculate (total uncertainty) faces two problems. First, it is required to define the likelihood . For regression tasks, mean-variance estimation (MVE) is often used (Nix and Weigend, 1994). In MVE, the output of a neural network (with determined model weights ) is defined as a Gaussian distribution, and the task of the neural network is to give the mean and variance of the distribution:where refers to the model output. In practice, the output layer of the neural network is branched into two predictions (a 2-dimensions vector), the mean and the variance . Owing to the non-negativity of variance, we generally predict its log value. In addition to this minimal modification on model output layer, the loss function should be changed as the form shown as Equation (9), which is obtained by performing MAP inference on Gaussian probability density function [Equation (8)].where y is the true label of the sample .where N is the number of training samples, and is the true label of -th training sample . In MVE, labels are assumed to carry underlying Gaussian errors that indicate the noise in the labels. Pytorch (Paszke et al., 2019), a popular deep-learning python library, has implemented a function (torch.nn.functional.gaussian_nll_loss, version 1.11.0) for conveniently calculating MVE loss. The second problem is that the posterior distribution cannot be calculated analytically owing to the intractable calculation of . Some strategies are often used to make an approximation, for example, variational inference (VI) (Blei et al., 2017), which constructs a variational distribution to approximate by minimizing the Kullback-Leibler divergence between and . VI methods constitute a standard technique in Bayesian modeling. However, its high computational cost still limits its application. Thus, some approximate ways have been implemented to circumvent its computational intractability, such as an ensemble that consists in training the same network multiple times with random initialization. The process of training a model could be deemed as taking a sampling of the real distribution of model weights . Here, these approximate ways are classified as “Ensemble-based approaches” and will be detailed and introduced in next section. After the acquisition of sampled weights from , can be approximated as Equation (10), as proposed by Kendall et al. (Kendall and Gal, 2017):where are sampled model weights. Zhang et al. benchmarked this approach in the context of molecular property prediction based on 6 datasets (Zhang and Lee, 2019). Results showed that the total uncertainty is a better estimate of error than any single source of uncertainty. Scalia et al. drew the same conclusion in a recent benchmarking test of molecular property prediction, again highlighting the importance of considering both sources of uncertainty (Scalia et al., 2020). For classification problems, label can be expressed as:where is a one-hot encoded vector whose -th element is 1 and other positions are zeros. For example, for a typical binary classification problem, can be either [0, 1] or [1, 0]. The likelihood function, or the predicted probability of the model that the sample belongs to the c-th class, is given by:where is the -th element of the pre-activated model output and the probability vector is the final output. There exist several different methods for UQ in classification settings. Here we introduce two of them. The first was proposed by Kwon et al. (Kwon et al., 2020) which aimed at calculating the total variance of prediction as we have conducted in the regression setting:where is the predictive mean and is the prediction of a single model whose weights are sampled from , as is conducted in Equation (10). Ryu et al. applied this method to develop a Bayesian graph convolutional network (GCN) for molecular property prediction (Ryu et al., 2019). They demonstrated that the usage of Bayesian GCN in quantifying prediction uncertainty improves the virtual screening accuracy and can quantitatively evaluate training data quality. Kim et al. also used this method to develop a Bayesian neural network for protein-ligand interaction prediction, which showed better performance than previous baselines (Kim et al., 2021). Beker et al. applied this method for decomposing the total error within predictions of drug-likeness into the aleatoric and epistemic components (Beker et al., 2020). For the second method, instead of variance, the entropy of is used to quantify the uncertainty of probability distribution (Shannon, 1948). However, the entropy of a single model output does not distinguish between aleatoric and epistemic uncertainties. To achieve this goal, Smith et al. (Smith and Gal, 2018) proposed that the predictive entropy can be taken as the total uncertainty, the expected entropy as the aleatoric uncertainty, and the mutual information as the epistemic uncertainty. Once the ensemble has obtained, these terms can be approximated as the following equation: Yildirim et al. used this method to filter out false positive predictions in the semantic segmentation of particle instances in EM images (Yildirim and Cole, 2021). Except for BNN, the Gaussian process (GP) is another classical Bayesian machine-learning approach that can provide native uncertainty for its predictions (Williams and Rasmussen, 1996). The most obvious similarity between GP and BNN is that the predictions of these models are both probabilistic and can be used to infer predictive uncertainty or compute empirical confidence intervals. Taking regression as an example, Gaussian Process Regression (GPR) models the inputs and outputs using Equation (16):where is a latent function and is a noise term, which is typically assumed to be normally distributed with zero mean and noise variance . Instead of explicitly modeling using the neural network architecture, as is conducted in BNN, in GPR the latent function is supposed to be drawn from a Gaussian Process prior with mean function and covariance function . Same as the MVE method [Equation (7)], in GPR the predictive distribution of for a test sample also follows a Gaussian distribution:in which the mean and variance can be calculated using Equations (18) and (19):where , and . This process equals marginalizing over the infinite possible latent functions The same as Equation (5), here can be taken as the uncertainty of the prediction. As a non-parametric model (function form of is not specified), GP is more flexible than BNN, but suffers the burden of storing training data points for computing the covariance matrix (Li et al., 2021). The machine-learning package scikit-learn (Pedregosa et al., 2011) provides a convenient API for building GP models. The application of GP in computational chemistry and chemoinformatics has been well studied (Deringer et al., 2021). DiFranzo et al. proposed a nearest neighbor Gaussian process model for QSAR modeling. They found that the variance of model output provides calibrated uncertainty estimation (DiFranzo et al., 2020). Musil et al. presented a scheme based on subsampling and sparse GP regression for fast and reliable uncertainty estimation in the task of atomic and molecular property prediction (Musil et al., 2019). Xiang et al. proposed a GP model with a hybrid kernel, GP-MGK, for molecular property prediction (Xiang et al., 2021). They found that GP-MGK outperformed D-MPNN, a kind of graph convolutional neural network, regarding uncertainty quantification. These examples have demonstrated the usefulness of GP in chemical modeling and uncertainty estimation. However, more benchmarking tests are still needed for the comparison of GP with other state-of-the-art deep learning models (Hirschfeld et al., 2020).

Ensemble-based approaches

It has long been observed that ensemble learning improves predictive performance (Dietterich, 2000). Except for this, however, ensemble learning can also be used for UQ (Lakshminarayanan et al., 2017). Ensemble learning aims at constructing multiple similar but different base learners. In general, the predictions of the base learners are integrated into the final prediction (e.g., mean, median, and so forth) and their variance of them is deemed as an estimate of epistemic uncertainty. Here, we take random forest (for regression) as an example to illustrate the usage of ensemble-based UQ approaches in practice. For a query sample , the prediction is provided as the average of the predictions of all decision trees (base learners) , and the uncertainty of this sample can be provided by the variance of the predictions of all decision trees.where T is the number of decision trees . Different base learners will tend to output similar prediction values when the inputs are similar to the observed training data because each base learner’s weights, even if different, are optimized for those data. In contrast, as inputs become less similar to the training data, the outputs of each base learner tend to be more sensitive to the specificities of the suboptimal solution reached, thus the higher variance (Scalia et al., 2020). Given this, it seems clear that diversity in the base learners should be promoted for uncertainty improvement. The general idea for promoting diversity is to introduce randomness into the training process, and the commonly used methods could be categorized into four styles: data, features, outputs, and weights perturbations. For clarity, they are visualized in Figure 4. These perturbation methods and associated UQ methods are reviewed in the following sections.

Figure 4

Illustration of ensemble-based UQ methods

(A) Data perturbation. Sub-models are trained based on different subsets of the original training set.

(B) Features perturbation. Sub-models are trained based on different subsets of the original sample features.

(D) Weights perturbation. The sub-models are generated by keeping dropout open in the prediction process.

Illustration of ensemble-based UQ methods (A) Data perturbation. Sub-models are trained based on different subsets of the original training set. (B) Features perturbation. Sub-models are trained based on different subsets of the original sample features. (C) Outputs perturbation. The output of the model is no longer a deterministic value, but a difference. (D) Weights perturbation. The sub-models are generated by keeping dropout open in the prediction process.

Data perturbation

Dataset perturbation is usually based on sampling. Given an initial dataset, different subsets could be sampled and then used to train different base learners for increasing diversity (Figure 4A). For example, bootstrapping (also referred to as bagging) is a popular technique where base learners are trained on different bootstrap samples of the original training set (Scalia et al., 2020). Dataset perturbation is highly efficient with some types of base learners such as neural networks that are sensitive to training data, but it may also impair the predictive performance of neural networks owing to the shrinkage of training data.

Features perturbation

For ML models, training samples are always represented by a set of attributes (e.g., molecular descriptors or molecular fingerprints) that could be thought of as a feature space, and different feature subspaces could provide various perspectives on samples. As shown in Figure 4B, features perturbation aims at describing samples from different feature subspaces to increase the diversity of the trained base learners. One of the most representative models is random forest (Saxe et al., 2021). The diversity of the base learners in the RF algorithm not only derives from data perturbation (bootstrap sampling), but also from features perturbation. Accordingly, the generalization ability of the final model could be improved and the variance of the predictions of these base learners could be regarded as predictive uncertainty (Sheridan, 2012). Some data augmentation methods used in deep learning also share similarities with features perturbation. For example, considering that SMILES (Simplified molecular input line entry system) of a molecule are not unique, Kimber et al. used different SMILES to represent the same molecule for data augmentation, where SMILES are the input format of their model (Kimber et al., 2021). Similar to features perturbation, different SMILES can provide different perspectives on the same molecule. Based on this data augmentation method, they found that in addition to the benefit in the model performance, the variance of the predictions of the SMILES corresponding to the same molecule could also be taken as an estimate of uncertainty.

Outputs perturbation

Outputs perturbation (Figure 4C) enhances diversity by replacing the original task with other related tasks. For example, DeltaDelta, a pairwise difference regression model proposed by Jimenez-Luna et al., replaces the absolute activity (pIC50) of a ligand with the activity difference (ΔpIC50) between a pair of ligands as output (Jimenez-Luna et al., 2019). For DeltaDelta, a predicted pIC50 value of a new ligand could be recovered by first predicting the ΔpIC50 between the new ligand and any previously seen (pIC50 known) reference ligand, and then adding back in the pIC50 value of the reference molecule. By conducting this prediction procedure for all reference ligands and the new ligand, multiple predicted values of its pIC50 could be obtained and the variance of these predicted values could be regarded as an estimate of the uncertainty. Tynes et al. transferred this idea to molecular property prediction and observed similar results (Tynes et al., 2021).

Weights perturbation

Compared with other perturbation methods, weights perturbation methods force the base learners to get different weights more directly. Two representative examples are Deep Ensemble (Lakshminarayanan et al., 2017) and MC-dropout (Gal and Ghahramani, 2016; Kendall and Gal, 2017). Deep Ensemble is designed to train multiple base learners of the same structure with random initialization of model weights. Thus, different solutions can be easily reached by the base learners given their non-convexity and the suboptimal optimization strategies employed. MC-dropout consists in training a network with dropout before every layer and then, in the inference process, keeping dropout open to sample multiple outputs with different random masks (Figure 4D). Owing to the model-agnostic nature and ease of implementation, weights perturbation methods can be considered state-of-the-art for epistemic UQ in neural networks (Soleimany et al., 2021).

Application of uncertainty quantification in drug discovery

Estimation of model maximum achievable accuracy

The performance of in silico models depends on the quality of the training data (Saxe et al., 2021), and in most drug discovery projects, the labels of training data are always defined by experimental measurements with inherent variability (Kolmar and Grulke, 2021). As a result, the intrinsic label uncertainty or noise in the training data determines the maximum achievable accuracy (MAA) of models (Kramer et al., 2012). Estimating the MAA of models based on the currently available data is highly instructive for follow-up machine learning studies. For example, if the accuracy of a model has approached the possible MAA, we should pay more attention to expanding the dataset or improving the quality of the training data rather than considering more sophisticated model architecture. Given the close relationship between the label uncertainty of training data and the MAA of models described above, the problem of how to estimate the MAA of a model can be divided into two sub-problems: (1) How to estimate the label uncertainty in the currently available data, and (2) how to quantify the relationship between the label uncertainty and the MAA. A previous work by Kramer et al. provided the paradigm for the first sub-problem (Kramer et al., 2012). They first extracted all the high-quality Ki data from the ChEMBL database (Gaulton et al., 2012) through a series of data filtering steps. After that, they analyzed the differences between the published Ki measurements of identical protein-ligand systems to estimate the experimental error in the Ki data. Their experimental (or label) uncertainty estimation yielded a mean error of 0.44 pKi units with a standard deviation of 0.54 pKi units, which means that if the average error of a model based on heterogeneous (i.e., various laboratories, assay conditions, assay methods) sources of data are less than 0.44 pKi units, it is very likely that the model is overtrained. This work inspired a series of follow-up similar studies, such as quantitative estimation of label uncertainty in IC50 (Kalliokoski et al., 2013) and cytotoxicity data set (Cortes-Ciriano and Bender, 2016). For the second sub-problem, several studies have attempted to artificially add simulated noises (usually sampled from normal distributions with different variances) to the labels of dataset to study the correlation between the label uncertainty of modeling data and model performance (Kolmar and Grulke, 2021; Sheridan et al., 2020). In this way, the originally unknown data noise is turned into a controllable variable with a known value. Kolmar et al. added 15 levels of simulated Gaussian distributed random error to 8 different QSAR datasets, and systematically evaluated the impact of random errors in the datasets on model performance using 5 different algorithms (Kolmar and Grulke, 2021). They found that the model performance did deteriorate with the introduction of label noise, and different kinds of machine learning models show varying degrees of robustness to noise. In addition to directly estimating the average error of data, another strategy to infer the MAA of models is uncertainty quantification. Specifically, in the Bayesian system, total uncertainty can be divided into aleatoric and epistemic uncertainty according to different sources. The former is the result of irreducible and inherent data noise. The latter is caused by the insufficiency of knowledge provided by the training set. A more detailed description of them has been provided in the “sources of uncertainty in drug discovery” section. Therefore, the proportion of predicted aleatoric uncertainty in the total predicted uncertainty can be used to estimate whether a model has reached the possible MAA. Beker et al. systematically evaluated the performance of various AI models on the prediction of molecular drug-likeness using different types of molecular descriptors (Beker et al., 2020), where Deep Ensemble is used for uncertainty quantification. Based on the result that total uncertainty is comparable with its aleatoric contribution, they infer that the classification accuracy reported in their work (0.93) is probably the upper limit achievable with the current collection of known drugs.

Active learning

Owing to the time- and resource-intensive nature of biological and chemical experiments, how to generate new data to improve model performance more efficiently is a key problem in drug discovery (Yu et al., 2021). To address this issue, active learning (AL), an uncertainty-guided algorithm, has begun to show promise and has increasingly been used (Ding et al., 2021; Gong et al., 2021; Jansen et al., 2019; Yang et al., 2021). In AL, a model is typically initialized with a limited training set (e.g., currently available samples). Then, batches of unlabeled samples are iteratively selected based on a pre-defined query strategy (also referred to as selection function), labeled through associated experiments, and gradually added to the training set. The model is subsequently retrained using this expanded training set, with the expectation of more gains in prediction results on a held-out test set. The query strategy is usually referred to a sampling method to decide which samples should be selected and labeled for each iteration, which is one of the most important components of AL. Depending on the query strategy used, AL could be divided into three categories: exploration-oriented AL, exploitation-oriented AL, and hybrid AL (Ren et al., 2020). Exploration-oriented AL aims to select samples with the greatest predictive uncertainty. These samples may possess novel structures relative to their counterparts in the original training set. As a result, the AD of the retrained model could be enlarged effectively owing to the introduction of novel SAR. For example, Ding et al. explored the effectiveness of four UQ methods in exploration-oriented AL through a case study on the plasma exposure of orally administered drugs, and they found that the query strategy based on entropy is the most sample-efficient strategy (Ding et al., 2021). Besides, through complete experimental verification, their work also highlighted the effectiveness of the exploration-oriented AL in expanding the AD of models and guiding the experiment design. Instead of selecting samples based on uncertainty, exploitation-oriented AL provides a framework to discover high-performing compounds (e.g., those with more favorable molecular properties) from a large search space by selecting the unlabeled samples with the highest scores in the iterative process. A typical application scenario of exploitation-oriented AL is structure-based virtual screening (VS) (Neves et al., 2018). As virtual libraries continue to grow [e.g., ZINC (Sterling and Irwin, 2015) now contains roughly 1 billion molecules], the computational resources necessary to conduct exhaustive virtual screening campaigns on these libraries are inaccessible to many academic researchers. Given this, combined with the AL algorithm, Graff et al. proposed a QSAR model to predict molecules’ docking scores, which could enrich most of the molecules with high docking scores when only a few molecules were docked (Graff et al., 2021). However, they found that the chemical diversity of the molecules enriched by the QSAR model with purely exploitation-oriented AL is extremely low. To increase the chemical diversity, they employed a hybrid AL query strategy that incorporates both predicted docking scores and uncertainties to guide sample selection in the iterative process, which shows the unique status of UQ in the application of AL. Because of its flexibility in adjusting exploration-exploitation trade-off, hybrid AL query strategies (e.g., upper confidence bound) have gradually become the most widely used sampling methods in AL.

Virtual screening

High-throughput virtual screening has emerged as an important approach to identifying hit compounds from large chemical libraries (Shoichet, 2004). Among different types of VS strategies, DL-based VS has shown a promising hit rate and high throughput (Neves et al., 2018). In a typical workflow of DL-based VS, the drug-like compounds from a library are scored by a DL model, in which the top-scored ones are selected for further experimental verification. However, most commonly used chemical libraries cover extensive chemical space, most of which do not contain compounds with well-studied structures. It may cause a model to give overconfident predictions, accounting for the limited enrichment ability of conventional DL-based VS models. Incorporating UQ into the selection process to ensure the robustness of predictions is an intuitive way to deal with this problem. For example, if the DL model is trained to predict the pIC50 value (referred as ) and corresponding uncertainty (referred as ), Equation (22) can be used to prioritize compounds instead of directly using the descending order of :where is a user-defined parameter deciding the extent of uncertainty penalty, and is the acquisition score. It should be noticed that the common practice is using pIC50 values as modeling tasks instead of the raw IC50 values. Compared with that of IC50 values, the distribution of pIC50 values in biological datasets is more in line with the Gaussian distribution, thus the conversion from IC50 to pIC50 can be taken as a kind of label scaling, making the prediction for both target values and uncertainties more accurate for machine-learning models. Hie et al. valid the effectiveness of this strategy based on the task of modeling compound-kinase interaction (Hie et al., 2020). In this study, GP was used to conduct uncertainty quantification for model prediction. Compared with the predictions without uncertainty, they found that the one with UQ can prioritize interactions with lower Kd, while ignoring uncertainty will lead to higher false positive results. A retrospective virtual screening study by Soleimany et al. (Soleimany et al. (2021) also showed that filtering the results based on estimated uncertainty can increase the hit rate. In PIGNet, a DL-based drug-target interaction prediction model, MC-dropout is used to quantify the uncertainty and filter unreliable positive predictions (Moon et al., 2022). Except for considering the uncertainty in an explicit way as shown in Equation (14), some studies proposed that constructing the VS model using a BNN framework to eliminate the model uncertainty during prediction can also improve the VS model accuracy (Kim et al., 2021; Ryu et al., 2019).

Improving model accuracy and robustness

Most strategies we have introduced so far treat UQ as an independent module in the workflow of the model establishment. An important reason is that we hope to make a trade-off between model accuracy and explanation. It is less favorable to obtain model explanation at the expense of accuracy dropping. However, recent studies have shown that building models with the consideration of uncertainty may have a beneficial side effect of further improving the model accuracy. These kinds of models are called uncertainty-aware models. A typical example is MVE which has been introduced in Section 3.2. By changing the loss function, MVE is able to capture the aleatoric uncertainty inherent in data with heteroscedastic assumptions. It means that for data regions with high noise, the model can assign large uncertainty instead of overfitting them. Kwon et al. compared the MVE loss function with traditional mean squared error (MSE) loss in the task of reaction yield prediction (Kwon et al., 2022). They found that MVE loss slightly outperformed MSE loss regarding model prediction performance. Previously, we also observed a similar phenomenon in a work on building a hybrid uncertainty quantification method (Wang et al., 2021). For regression problems, well-calibrated uncertainty can be treated as the variance of the error, thus there is an intuitive way to combine predictions and uncertainties into a more informative format, for example, the confidence interval. However, for classification problems, it is not easy to integrate these two parts together. To this end, it is essential to build an uncertainty-aware classification model architecture that could provide well-calibrated probabilities and avoid giving overconfident predictions for out-of-distribution samples. Han et al. recently proposed GNN-SNGP which can reduce overconfident misprediction by applying Gaussian Process and Spectral Normalization into model architecture (Han et al., 2021). Results on CardioTox, a cardiotoxicity dataset with a significant distribution shift, showed that GNN-SNGP can improve model accuracy and provide well-calibrated predictions. Mervin et al. presented a novel protein-ligand interaction classifier using Probabilistic Random Forest (PRF). In PRF, the original bioactivity value (e.g., ) is converted to a probability (e.g., 0.63) as a label using the cumulative distribution function of a normal distribution to show how possible the compound can bind to a target. In this way, the labels are considered following as probability distribution rather than as deterministic values, and uncertainty of bioactivity labels are aleatorically introduced into model construction. Bioactivity prediction benchmarking tests showed that PRF outperformed traditional random forest regarding several common classification evaluation metrics, such as F1-score and balanced accuracy (Mervin et al., 2021b).

Conclusion and perspective

In this review article, the background and sources of uncertainty are introduced first. Then three kinds of uncertainty quantification methods with different philosophical reasoning and four typical application scenarios where UQ is indispensable are explored in detail. We hope this content will be helpful and enlightening to readers who are not embedded in this field. Current UQ also faces technical challenges. There is no consensus on optimal UQ methods. For different downstream tasks and task scenarios, the most appropriate UQ method is not consistent. Many UQ methods do not come as readily usable, but need to be tailored to each application scenario. Thus, designing benchmarking datasets with different degrees of domain shift is an urgent need for a fair and comprehensive comparison between different UQ methods. Different ML model architectures should also be benchmarked for the UQ methods that serve as an independent module, which will enable users to choose UQ methods more conveniently according to the specific model architecture they used in their projects. In the development process, uncertainty-aware models should be compared with conventional deep learning models without uncertainty measurements regarding accuracy and robustness to explore the potential benefits. In addition, many researches on UQ often focus on theoretical proof while ignoring practical considerations, which is one of the most concerned aspects of users. Therefore, it is highly recommended that subsequent UQ studies should summarize the differences from conventional ML models in the deployment process, and demonstrate the practicability of the proposed UQ methods with some applied case studies (e.g., virtual screening), as Soleimany et al. did in their work (Soleimany et al., 2021). Moreover, some UQ methods do not differentiate between aleatoric and epistemic uncertainty, which play different roles in the uncertainty domain. For example, aleatoric uncertainty can be used to infer model maximum achievable accuracy while epistemic uncertainty is able to guide sample selection in an AL setting. Thus, UQ methods that mix up these two types of uncertainty will be less ideal and their applications are limited. Finally, most of the UQ methods do not show evident calibration ability, especially for out-of-domain samples. Considering the ability is vital in inferring the label range of test samples, more emphasis should be placed on the improvement of calibration ability when developing novel UQ methods. According to the above discussion, an ideal UQ method requires the following properties: (1) supported by a solid theoretical foundation or a reasonable assumption, (2) easy to deploy, (3) disentangling aleatoric from epistemic uncertainty, (4) improvement on model accuracy, (5) possessing calibration ability, (6) low computational burden, although full compliance with these requirements may be difficult to achieve. Overall, we still have a long way to go in terms of UQ, before AI can play a more substantial role in decision making at different stages of drug development.

57 in total

1. Three useful dimensions for domain applicability in QSAR models using random forest.

Authors: Robert P Sheridan
Journal: J Chem Inf Model Date: 2012-03-09 Impact factor: 4.956

2. The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity.

Authors: Robert P Sheridan
Journal: J Chem Inf Model Date: 2015-06-04 Impact factor: 4.956

3. Biased Complement Diversity Selection for Effective Exploration of Chemical Space in Hit-Finding Campaigns.

Authors: Johanna M Jansen; Gianfranco De Pascale; Susan Fong; Mika Lindvall; Heinz E Moser; Keith Pfister; Bob Warne; Charles Wartchow
Journal: J Chem Inf Model Date: 2019-04-03 Impact factor: 4.956

4. Molecular Similarity-Based Domain Applicability Metric Efficiently Identifies Out-of-Domain Compounds.

Authors: Ruifeng Liu; Anders Wallqvist
Journal: J Chem Inf Model Date: 2018-11-19 Impact factor: 4.956

5. GGL-Tox: Geometric Graph Learning for Toxicity Prediction.

Authors: Jian Jiang; Rui Wang; Guo-Wei Wei
Journal: J Chem Inf Model Date: 2021-03-15 Impact factor: 4.956

6. Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein-Ligand Predictions.

Authors: Lewis H Mervin; Avid M Afzal; Ola Engkvist; Andreas Bender
Journal: J Chem Inf Model Date: 2020-09-21 Impact factor: 4.956

7. How Consistent are Publicly Reported Cytotoxicity Data? Large-Scale Statistical Analysis of the Concordance of Public Independent Cytotoxicity Measurements.

Authors: Isidro Cortés-Ciriano; Andreas Bender
Journal: ChemMedChem Date: 2015-11-06 Impact factor: 3.466